What happened?

I use 7900xtx, only 3~t/s when I use llama.cpp inference qwen2-7b-instruct-q5_k_m.gguf, even if I set -ngl 1000 or -ngl 0, I still find that the VRAM usage of the GPU is very low, the RAM usage of the system memory is high, and the GPU usage is 90%+ during inference. I tested gemma2 and WizardLM-2-13b.Q8_0 with the same issue! So I don't think it's a problem with the model, it should be a general problem: the model is loaded into RAM instead of VRAM, and the GPU reads the model directly from RAM?

Name and Version

version: 3259 (e57dc620) built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

(base) deeper@deeper:~/Documents/llama.cpp$ ./llama-cli -m ../qwin2-7b-gguf/qwen2-7b-instruct-q6_k.gguf    -n 512 -co -i -if -f prompts/chat-with-qwen.txt   --in-prefix "<|im_start|>user\n"   --in-suffix "<|im_end|>\n<|im_start|>assistant\n"   -ngl 1000 -fa
Log start
main: build = 3259 (e57dc620)
main: built with cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0 for x86_64-linux-gnu
main: seed  = 1719574232
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from ../qwin2-7b-gguf/qwen2-7b-instruct-q6_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = qwen2-7b-instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 3584
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 18944
llama_model_loader: - kv   6:                 qwen2.attention.head_count u32              = 28
llama_model_loader: - kv   7:              qwen2.attention.head_count_kv u32              = 4
llama_model_loader: - kv   8:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv   9:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  10:                          general.file_type u32              = 18
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  12:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  15:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  19:                    tokenizer.chat_template str              = {% for message in messages %}{% if lo...
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                      quantize.imatrix.file str              = ../Qwen2/gguf/qwen2-7b-imatrix/imatri...
llama_model_loader: - kv  23:                   quantize.imatrix.dataset str              = ../sft_2406.txt
llama_model_loader: - kv  24:             quantize.imatrix.entries_count i32              = 196
llama_model_loader: - kv  25:              quantize.imatrix.chunks_count i32              = 1937
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q6_K:  198 tensors
llm_load_vocab: special tokens cache size = 421
llm_load_vocab: token to piece cache size = 0.9352 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 3584
llm_load_print_meta: n_head           = 28
llm_load_print_meta: n_head_kv        = 4
llm_load_print_meta: n_layer          = 28
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 7
llm_load_print_meta: n_embd_k_gqa     = 512
llm_load_print_meta: n_embd_v_gqa     = 512
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 18944
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q6_K
llm_load_print_meta: model params     = 7.62 B
llm_load_print_meta: model size       = 5.82 GiB (6.56 BPW) 
llm_load_print_meta: general.name     = qwen2-7b-instruct
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.30 MiB
llm_load_tensors: offloading 28 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 29/29 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  5532.43 MiB
llm_load_tensors:        CPU buffer size =   426.36 MiB
.......................................................................................
llama_new_context_with_model: n_ctx      = 32768
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      ROCm0 KV buffer size =  1792.00 MiB
llama_new_context_with_model: KV self size  = 1792.00 MiB, K (f16):  896.00 MiB, V (f16):  896.00 MiB
llama_new_context_with_model:  ROCm_Host  output buffer size =     0.58 MiB
llama_new_context_with_model:      ROCm0 compute buffer size =   304.00 MiB
llama_new_context_with_model:  ROCm_Host compute buffer size =    71.01 MiB
llama_new_context_with_model: graph nodes  = 875
llama_new_context_with_model: graph splits = 2
main: chat template example: <|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

system_info: n_threads = 6 / 12 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
main: interactive mode on.
Input prefix: '<|im_start|>user
'
Input suffix: '<|im_end|>
<|im_start|>assistant
'
sampling: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.

You are a helpful assistant.<|im_start|>user
hello  
<|im_end|>
<|im_start|>assistant
Hello! How can I assist you today?
<|im_start|>user

llama_print_timings:        load time =    4273.47 ms
llama_print_timings:      sample time =       0.71 ms /    10 runs   (    0.07 ms per token, 14144.27 tokens per second)
llama_print_timings: prompt eval time =    6727.18 ms /    16 tokens (  420.45 ms per token,     2.38 tokens per second)
llama_print_timings:        eval time =    3771.96 ms /     9 runs   (  419.11 ms per token,     2.39 tokens per second)
llama_print_timings:       total time =   12441.43 ms /    25 tokens

Try it without -fa, it does not improve performance on rocm in my testing, it usually makes it just a bit worse. (shouldn't be nearly as bad as in your results though)

I have quite fine performance with my gfx1100 card, (Pro W7800), here's some llama-bench output with models of various sizes, sadly do not have qwen2-7b downloaded to compare directly(but my qwen 1.5 32B is faster than your qwen2-7b, which seems weird....):

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Pro W7800, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | CUDA       | 999 |      16 |         pp512 |    404.84 ± 0.46 |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | CUDA       | 999 |      16 |         tg512 |     15.73 ± 0.01 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| gemma2 9B Q8_0                 |  11.66 GiB |    10.16 B | CUDA       | 999 |      16 |         pp512 |   1209.62 ± 2.94 |
| gemma2 9B Q8_0                 |  11.66 GiB |    10.16 B | CUDA       | 999 |      16 |         tg512 |     31.46 ± 0.02 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| phi3 3B Q6_K                   |   2.92 GiB |     3.82 B | CUDA       | 999 |      16 |         pp512 |   2307.62 ± 8.44 |
| phi3 3B Q6_K                   |   2.92 GiB |     3.82 B | CUDA       | 999 |      16 |         tg512 |     78.00 ± 0.15 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 70B IQ3_XXS - 3.0625 bpw |  25.58 GiB |    70.55 B | CUDA       | 999 |      16 |         pp512 |    126.48 ± 0.35 |
| llama 70B IQ3_XXS - 3.0625 bpw |  25.58 GiB |    70.55 B | CUDA       | 999 |      16 |         tg512 |     10.01 ± 0.10 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | CUDA       | 999 |      16 |         pp512 |  1237.92 ± 12.16 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | CUDA       | 999 |      16 |         tg512 |     51.17 ± 0.09 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 30B Q5_K - Small         |  22.08 GiB |    34.39 B | CUDA       | 999 |      16 |         pp512 |    297.87 ± 0.65 |
| llama 30B Q5_K - Small         |  22.08 GiB |    34.39 B | CUDA       | 999 |      16 |         tg512 |     15.19 ± 0.04 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama ?B Q6_K                  |  24.84 GiB |    32.51 B | CUDA       | 999 |      16 |         pp512 |    365.29 ± 1.16 |
| llama ?B Q6_K                  |  24.84 GiB |    32.51 B | CUDA       | 999 |      16 |         tg512 |     14.15 ± 0.03 |

So it seems to me this is either a 7900xtx specific issue or you compiled it differently ? For me VRAM is also used as expected...

benchmark script:

./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/gemma-2-27b-it-imat-Q6_K_L.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/gemma-2-9b-it-imat-Q8_0_L.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/phi-3-mini-4k-instruct-imat-Q6_K.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/Meta-Llama-3-70B-Instruct-IQ3_XXS.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/meta-llama-3-8b-instruct-imat-Q6_K.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/yi-1.5-34b-chat-16k-imat-Q5_K_S.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/qwen1.5-32b-chat-imat-Q6_K.gguf

Try it without -fa, it does not improve performance on rocm in my testing, it usually makes it just a bit worse. (shouldn't be nearly as bad as in your results though)

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Pro W7800, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | CUDA       | 999 |      16 |         pp512 |    404.84 ± 0.46 |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | CUDA       | 999 |      16 |         tg512 |     15.73 ± 0.01 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| gemma2 9B Q8_0                 |  11.66 GiB |    10.16 B | CUDA       | 999 |      16 |         pp512 |   1209.62 ± 2.94 |
| gemma2 9B Q8_0                 |  11.66 GiB |    10.16 B | CUDA       | 999 |      16 |         tg512 |     31.46 ± 0.02 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| phi3 3B Q6_K                   |   2.92 GiB |     3.82 B | CUDA       | 999 |      16 |         pp512 |   2307.62 ± 8.44 |
| phi3 3B Q6_K                   |   2.92 GiB |     3.82 B | CUDA       | 999 |      16 |         tg512 |     78.00 ± 0.15 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 70B IQ3_XXS - 3.0625 bpw |  25.58 GiB |    70.55 B | CUDA       | 999 |      16 |         pp512 |    126.48 ± 0.35 |
| llama 70B IQ3_XXS - 3.0625 bpw |  25.58 GiB |    70.55 B | CUDA       | 999 |      16 |         tg512 |     10.01 ± 0.10 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | CUDA       | 999 |      16 |         pp512 |  1237.92 ± 12.16 |
| llama 8B Q6_K                  |   6.14 GiB |     8.03 B | CUDA       | 999 |      16 |         tg512 |     51.17 ± 0.09 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 30B Q5_K - Small         |  22.08 GiB |    34.39 B | CUDA       | 999 |      16 |         pp512 |    297.87 ± 0.65 |
| llama 30B Q5_K - Small         |  22.08 GiB |    34.39 B | CUDA       | 999 |      16 |         tg512 |     15.19 ± 0.04 |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama ?B Q6_K                  |  24.84 GiB |    32.51 B | CUDA       | 999 |      16 |         pp512 |    365.29 ± 1.16 |
| llama ?B Q6_K                  |  24.84 GiB |    32.51 B | CUDA       | 999 |      16 |         tg512 |     14.15 ± 0.03 |

So it seems to me this is either a 7900xtx specific issue or you compiled it differently ? For me VRAM is also used as expected...

benchmark script:

./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/gemma-2-27b-it-imat-Q6_K_L.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/gemma-2-9b-it-imat-Q8_0_L.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/phi-3-mini-4k-instruct-imat-Q6_K.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/Meta-Llama-3-70B-Instruct-IQ3_XXS.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/meta-llama-3-8b-instruct-imat-Q6_K.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/yi-1.5-34b-chat-16k-imat-Q5_K_S.gguf
./llama-bench \
              -t 16 -ngl 999 -p 512 -n 512 -r 3 \
              -m ../models/qwen1.5-32b-chat-imat-Q6_K.gguf

My build cl: make -j10 GGML_HIPBLAS=1 I have tested, not using -fa gives the same result as using it!

I've just downloaded qwen2-7b (q6_k), from here and also rebuild using exactly your make line and get this:

| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |      16 |         pp512 |   1640.81 ± 6.91 |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |      16 |         tg512 |     54.75 ± 0.11 |

I've also run the exact line from your issue (with only the model path adapted) and get these timings:

llama_print_timings:        load time =    3101,30 ms
llama_print_timings:      sample time =       0,82 ms /    17 runs   (    0,05 ms per token, 20858,90 tokens per second)
llama_print_timings: prompt eval time =   23400,85 ms /    16 tokens ( 1462,55 ms per token,     0,68 tokens per second)
llama_print_timings:        eval time =     289,94 ms /    16 runs   (   18,12 ms per token,    55,18 tokens per second)
llama_print_timings:       total time =   28498,56 ms /    32 tokens

Something weird is going on...

Which rocm version are you on ? I'm on 6.0.2

Full log

``` ./llama-cli -m ../../models_download/direct/qwen2-7b-instruct-imat-Q6_K.gguf -n 512 -co -i -if -f prompts/chat-with-qwen.txt --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n" -ngl 100 -fa Log start main: build = 3260 (139cc621) main: built with gcc (GCC) 13.3.0 for x86_64-unknown-linux-gnu main: seed = 1719583145 llama_model_loader: loaded meta data with 25 key-value pairs and 339 tensors from ../../models_download/direct/qwen2-7b-instruct-imat-Q6_K.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = qwen2 llama_model_loader: - kv 1: general.name str = Qwen2-7B-Instruct llama_model_loader: - kv 2: qwen2.block_count u32 = 28 llama_model_loader: - kv 3: qwen2.context_length u32 = 32768 llama_model_loader: - kv 4: qwen2.embedding_length u32 = 3584 llama_model_loader: - kv 5: qwen2.feed_forward_length u32 = 18944 llama_model_loader: - kv 6: qwen2.attention.head_count u32 = 28 llama_model_loader: - kv 7: qwen2.attention.head_count_kv u32 = 4 llama_model_loader: - kv 8: qwen2.rope.freq_base f32 = 1000000,000000 llama_model_loader: - kv 9: qwen2.attention.layer_norm_rms_epsilon f32 = 0,000001 llama_model_loader: - kv 10: general.file_type u32 = 18 llama_model_loader: - kv 11: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 12: tokenizer.ggml.pre str = qwen2 llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 15: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",... llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 151645 llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32 = 151643 llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 151643 llama_model_loader: - kv 19: tokenizer.chat_template str = {% for message in messages %}{% if lo... llama_model_loader: - kv 20: general.quantization_version u32 = 2 llama_model_loader: - kv 21: quantize.imatrix.file str = /home/tristand/ai/models/Qwen2-7B-Ins... llama_model_loader: - kv 22: quantize.imatrix.dataset str = /home/tristand/ai/tools/llama.cpp/cal... llama_model_loader: - kv 23: quantize.imatrix.entries_count i32 = 196 llama_model_loader: - kv 24: quantize.imatrix.chunks_count i32 = 193 llama_model_loader: - type f32: 141 tensors llama_model_loader: - type q6_K: 198 tensors llm_load_vocab: special tokens cache size = 421 llm_load_vocab: token to piece cache size = 0,9352 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = qwen2 llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 152064 llm_load_print_meta: n_merges = 151387 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 3584 llm_load_print_meta: n_head = 28 llm_load_print_meta: n_head_kv = 4 llm_load_print_meta: n_layer = 28 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 7 llm_load_print_meta: n_embd_k_gqa = 512 llm_load_print_meta: n_embd_v_gqa = 512 llm_load_print_meta: f_norm_eps = 0,0e+00 llm_load_print_meta: f_norm_rms_eps = 1,0e-06 llm_load_print_meta: f_clamp_kqv = 0,0e+00 llm_load_print_meta: f_max_alibi_bias = 0,0e+00 llm_load_print_meta: f_logit_scale = 0,0e+00 llm_load_print_meta: n_ff = 18944 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000,0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = ?B llm_load_print_meta: model ftype = Q6_K llm_load_print_meta: model params = 7,62 B llm_load_print_meta: model size = 5,82 GiB (6,56 BPW) llm_load_print_meta: general.name = Qwen2-7B-Instruct llm_load_print_meta: BOS token = 151643 '<|endoftext|>' llm_load_print_meta: EOS token = 151645 '<|im_end|>' llm_load_print_meta: PAD token = 151643 '<|endoftext|>' llm_load_print_meta: LF token = 148848 'ÄĬ' llm_load_print_meta: EOT token = 151645 '<|im_end|>' llm_load_print_meta: max token length = 256 ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Pro W7800, compute capability 11.0, VMM: no llm_load_tensors: ggml ctx size = 0,30 MiB llm_load_tensors: offloading 28 repeating layers to GPU llm_load_tensors: offloading non-repeating layers to GPU llm_load_tensors: offloaded 29/29 layers to GPU llm_load_tensors: ROCm0 buffer size = 5532,43 MiB llm_load_tensors: CPU buffer size = 426,36 MiB ....................................................................................... llama_new_context_with_model: n_ctx = 32768 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 1 llama_new_context_with_model: freq_base = 1000000,0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 1792,00 MiB llama_new_context_with_model: KV self size = 1792,00 MiB, K (f16): 896,00 MiB, V (f16): 896,00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0,58 MiB llama_new_context_with_model: ROCm0 compute buffer size = 304,00 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 71,01 MiB llama_new_context_with_model: graph nodes = 875 llama_new_context_with_model: graph splits = 2 main: chat template example: <|im_start|>system You are a helpful assistant<|im_end|> <|im_start|>user Hello<|im_end|> <|im_start|>assistant Hi there<|im_end|> <|im_start|>user How are you?<|im_end|> <|im_start|>assistant system_info: n_threads = 8 / 16 | AVX = 0 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | main: interactive mode on. Input prefix: '<|im_start|>user ' Input suffix: '<|im_end|> <|im_start|>assistant ' sampling: repeat_last_n = 64, repeat_penalty = 1,000, frequency_penalty = 0,000, presence_penalty = 0,000 top_k = 40, tfs_z = 1,000, top_p = 0,950, min_p = 0,050, typical_p = 1,000, temp = 0,800 mirostat = 0, mirostat_lr = 0,100, mirostat_ent = 5,000 sampling order: CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature generate: n_ctx = 32768, n_batch = 2048, n_predict = 512, n_keep = 0 == Running in interactive mode. == - Press Ctrl+C to interject at any time. - Press Return to return control to the AI. - To return control without starting a new line, end your input with '/'. - If you want to submit another line, end your input with '\'. You are a helpful assistant.<|im_start|>user hello <|im_end|> <|im_start|>assistant Hello! It's nice to meet you. How can I assist you today? <|im_start|>user llama_print_timings: load time = 3101,30 ms llama_print_timings: sample time = 0,82 ms / 17 runs ( 0,05 ms per token, 20858,90 tokens per second) llama_print_timings: prompt eval time = 23400,85 ms / 16 tokens ( 1462,55 ms per token, 0,68 tokens per second) llama_print_timings: eval time = 289,94 ms / 16 runs ( 18,12 ms per token, 55,18 tokens per second) llama_print_timings: total time = 28498,56 ms / 32 tokens ```

I've just downloaded qwen2-7b (q6_k), from here and also rebuild using exactly your make line and get this:

| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |      16 |         pp512 |   1640.81 ± 6.91 |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |      16 |         tg512 |     54.75 ± 0.11 |

I've also run the exact line from your issue (with only the model path adapted) and get these timings:

llama_print_timings:        load time =    3101,30 ms
llama_print_timings:      sample time =       0,82 ms /    17 runs   (    0,05 ms per token, 20858,90 tokens per second)
llama_print_timings: prompt eval time =   23400,85 ms /    16 tokens ( 1462,55 ms per token,     0,68 tokens per second)
llama_print_timings:        eval time =     289,94 ms /    16 runs   (   18,12 ms per token,    55,18 tokens per second)
llama_print_timings:       total time =   28498,56 ms /    32 tokens

Something weird is going on...

Which rocm version are you on ? I'm on 6.0.2 Full log

My env:

Package: rocm-libs
Version: 6.1.1.60101-90~22.04
Priority: optional
Section: devel
Maintainer: ROCm Dev Support <rocm-dev.support@amd.com>
Installed-Size: 13.3 kB
Depends: hipblas (= 2.1.0.60101-90~22.04), hipblaslt (= 0.7.0.60101-90~22.04), hipfft (= 1.0.14.60101-90~22.04), hipsolver (= 2.1.1.60101-90~22.04), hipsparse (= 3.0.1.60101-90~22.04), hiptensor (= 1.2.0.60101-90~22.04), miopen-hip (= 3.1.0.60101-90~22.04), half (= 1.12.0.60101-90~22.04), rccl (= 2.18.6.60101-90~22.04), rocalution (= 3.1.1.60101-90~22.04), rocblas (= 4.1.0.60101-90~22.04), rocfft (= 1.0.27.60101-90~22.04), rocrand (= 3.0.1.60101-90~22.04), hiprand (= 2.10.16.60101-90~22.04), rocsolver (= 3.25.0.60101-90~22.04), rocsparse (= 3.1.2.60101-90~22.04), rocm-core (= 6.1.1.60101-90~22.04), hipsparselt (= 0.1.0.60101-90~22.04), composablekernel-dev (= 1.1.0.60101-90~22.04), hipblas-dev (= 2.1.0.60101-90~22.04), hipblaslt-dev (= 0.7.0.60101-90~22.04), hipcub-dev (= 3.1.0.60101-90~22.04), hipfft-dev (= 1.0.14.60101-90~22.04), hipsolver-dev (= 2.1.1.60101-90~22.04), hipsparse-dev (= 3.0.1.60101-90~22.04), hiptensor-dev (= 1.2.0.60101-90~22.04), miopen-hip-dev (= 3.1.0.60101-90~22.04), rccl-dev (= 2.18.6.60101-90~22.04), rocalution-dev (= 3.1.1.60101-90~22.04), rocblas-dev (= 4.1.0.60101-90~22.04), rocfft-dev (= 1.0.27.60101-90~22.04), rocprim-dev (= 3.1.0.60101-90~22.04), rocrand-dev (= 3.0.1.60101-90~22.04), hiprand-dev (= 2.10.16.60101-90~22.04), rocsolver-dev (= 3.25.0.60101-90~22.04), rocsparse-dev (= 3.1.2.60101-90~22.04), rocthrust-dev (= 3.0.1.60101-90~22.04), rocwmma-dev (= 1.4.0.60101-90~22.04), hipsparselt-dev (= 0.1.0.60101-90~22.04)
Homepage: https://github.com/RadeonOpenCompute/ROCm
Download-Size: 1,060 B
APT-Sources: https://repo.radeon.com/rocm/apt/6.1.1 jammy/main amd64 Packages
Description: Radeon Open Compute (ROCm) Runtime software stack

@Lookforworld Did you try ROCm 6.1.3, which officaillly supports AMD Radeon™ 7000 series GPUs?

I just came across this post, and did a quick test with ROCm 6.1.3, with dual AMD RX 7900 XTX. I made notes of the setup at: https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu/tree/main

It seems it works as expected:

Speed Test: CPU vs CPU+GPUx2

To test, I ran the same prompt What is machine learning? with the same model file mistral.gguf.

CPU only

Build from source:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')

To run:

./llama-cli -t $(lscpu | grep '^Core(s)' | awk '{print $NF}') --temp 0 -m ../mistral.gguf -p "What is machine learning?"

Output:

llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3917.87 MiB
...
...
...
llama_print_timings:        load time =    1602.50 ms
llama_print_timings:      sample time =      11.38 ms /   571 runs   (    0.02 ms per token, 50193.39 tokens per second)
llama_print_timings: prompt eval time =      81.22 ms /     6 tokens (   13.54 ms per token,    73.88 tokens per second)
llama_print_timings:        eval time =   24270.78 ms /   570 runs   (   42.58 ms per token,    23.49 tokens per second)
llama_print_timings:       total time =   24522.14 ms /   576 tokens

CPU + GPU x 2

Build from source:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make GGML_HIPBLAS=1 AMDGPU_TARGETS=gfx1100 -j$(lscpu | grep '^Core(s)' | awk '{print $NF}')

Remarks: Use GGML_HIPBLAS instead of LLAMA_HIPBLAS

To run:

./llama-cli -t $(lscpu | grep '^Core(s)' | awk '{print $NF}') --temp 0 -m ../mistral.gguf -p "What is machine learning?" -ngl 33

Output:

ggml_cuda_init: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size =    0.41 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      ROCm0 buffer size =  1989.53 MiB
llm_load_tensors:      ROCm1 buffer size =  1858.02 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
...
...
...
llama_print_timings:        load time =    3440.52 ms
llama_print_timings:      sample time =      17.78 ms /   952 runs   (    0.02 ms per token, 53540.30 tokens per second)
llama_print_timings: prompt eval time =      12.38 ms /     6 tokens (    2.06 ms per token,   484.46 tokens per second)
llama_print_timings:        eval time =    8928.70 ms /   951 runs   (    9.39 ms per token,   106.51 tokens per second)
llama_print_timings:       total time =    9119.94 ms /   957 tokens

Result

The difference is more than obvious.

Out of curiosity, I just tried the qwen2-7b-instruct-imat-Q6_K.gguf, used by @Lookforworld and @tristandruyen

./llama-bench -t $(lscpu | grep '^Core(s)' | awk '{print $NF}') -ngl 999 -p 512 -n 512 -r 3 -m ../qwen2-7b-instruct-imat-Q6_K.gguf

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |         pp512 |  3618.46 ± 11.05 |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |         tg512 |     85.75 ± 0.08 |

I also tried the gemma-2-27b-it-Q6_K_L.gguf, used by @tristandruyen

./llama-bench -t $(lscpu | grep '^Core(s)' | awk '{print $NF}') -ngl 999 -p 512 -n 512 -r 3 -m ../gemma-2-27b-it-Q6_K_L.gguf

  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | ROCm       | 999 |         pp512 |    972.14 ± 1.75 |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | ROCm       | 999 |         tg512 |     27.00 ± 0.02 |

For the issue of running qwen2-7b-instruct-imat-Q6_K.gguf with llama-cli, is it possible something to do the gguf file itself, rather than 7900XTX or llama.cpp?

Out of curiosity, I just tried the qwen2-7b-instruct-imat-Q6_K.gguf, used by @Lookforworld and @tristandruyen

./llama-bench -t $(lscpu | grep '^Core(s)' | awk '{print $NF}') -ngl 999 -p 512 -n 512 -r 3 -m ../qwen2-7b-instruct-imat-Q6_K.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 ROCm devices:
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |         pp512 |  3618.46 ± 11.05 |
| qwen2 ?B Q6_K                  |   5.82 GiB |     7.62 B | ROCm       | 999 |         tg512 |     85.75 ± 0.08 |
I also tried the gemma-2-27b-it-Q6_K_L.gguf, used by @tristandruyen

./llama-bench -t $(lscpu | grep '^Core(s)' | awk '{print $NF}') -ngl 999 -p 512 -n 512 -r 3 -m ../gemma-2-27b-it-Q6_K_L.gguf
  Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
  Device 1: Radeon RX 7900 XTX, compute capability 11.0, VMM: no
| model                          |       size |     params | backend    | ngl |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | ---------------: |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | ROCm       | 999 |         pp512 |    972.14 ± 1.75 |
| gemma2 27B Q6_K                |  24.30 GiB |    28.41 B | ROCm       | 999 |         tg512 |     27.00 ± 0.02 |
For the issue of running qwen2-7b-instruct-imat-Q6_K.gguf with llama-cli, is it possible something to do the gguf file itself, rather than 7900XTX or llama.cpp?

@eliranwong I recompiled the llama.cpp in the officially provided docker, the problem still exists, this time qwen2 cannot be loaded into the model directly, and the inference speed of gemma2 is still very slow! This is my gpu info when the model running , the vram is nothing! Screenshot from 2024-06-29 23-22-59 Can you provide your development environment? For example, the Linux version, the ROCM version and the llama.cpp version? Thanks!

@Lookforworld

./llama-cli --version
version: 3265 (72272b83)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

I ran on a machine that has dual RX 7900 XTX, with ROCm 1.6.3. Its setup, including environment variables, was recorded at:

https://github.com/eliranwong/MultiAMDGPU_AIDev_Ubuntu

I also run in an incus container with similar setup, setup notes are available at:

https://github.com/eliranwong/incus_container_gui_setup/blob/main/tutorials/multiple_gpu.md

If you want to test the latest 1.6.3 with a container. Incus container is a good choice, more flexible than docker. I used it for testing before upgrading from 6.0.2 to 6.1.3.

Bty, I think the docker version is not the latest 6.1.3?

@Lookforworld Here is an output of rocm-smi when I ran an inference with llama.cpp. You can see GPUs are working with llama.cpp.

Screenshot from 2024-06-29 21-02-21

I think the issue is nothing to do with the card model, as both of us use RX 7900 XTX.

Also, llama.cpp functions as expected. I think your issue may relate to something else, like how you set up the GPU card.

@eliranwong @tristandruyen I switched to rocm6.1.3 issue solving. The results:

llama_print_timings:        load time =    4287.17 ms
llama_print_timings:      sample time =      60.73 ms /   490 runs   (    0.12 ms per token,  8067.97 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =    6134.97 ms /   490 runs   (   12.52 ms per token,    79.87 tokens per second)
llama_print_timings:       total time =    6719.66 ms /   490 tokens

ggerganov / llama.cpp

Bug: The inference speed of building with HIPBLAS (gfx1100) is very slow, only 2~5t/s #8186