ggerganov / llama.cpp

LLM inference in C/C++
MIT License
66.8k stars 9.6k forks source link

[MULTI-GPU][AMD/HIP5.7] Using the -mg param is critical #4763

Closed stejpet closed 6 months ago

stejpet commented 9 months ago

Hi. I have noticed that from my quick testing, setting the main GPU is critical for the function of the model.

My setup: 4 x AMD WX9100 / (Instinct MI25), 2 connected via 16x -> 4x 1x PCIe splitters, 2 connected with x16 to mobo. Ubuntu 22.04.3, 6.2.0-39-generic. ROCM 5.7.

Loading a model with all 4 cards cards require that I use -mg 2 or -mg 3. These are the two cards that are connected directly to the mobo 16x slots.

Trying to load a model with either two of the 1x connected cards as main GPU result in garbage being outputted.

Dragomir-Ivanov commented 9 months ago

Hi @stejpet Can you share a bit more about your experience with llama.cpp and MI25 apart from the issue you are having. Is the performance good? Which models you are using, their size? Quantization level you are using? Since MI25 is the most affordable accelerator one may have (apart from A$ MAX), it would be great starting point.

ccbadd commented 9 months ago

I'm curios about the performance of an MI25 also. I have a pair of MI100s and a pair of W6800s and the W6800s are a lot faster than the MI100s. I just don't think llama.cpp/Koboldcpp have the correct tunings for the MI cards.

stejpet commented 9 months ago

@Dragomir-Ivanov @ccbadd

CPU: Ryzen 9 5900X, 32GB RAM, 500GB m.2 SSD. GPU: 4 x AMD Instinct MI25, flashed to AMD Pro WX9100 firmware. I think it limits the performance a bit compared to the original firmware because the wattage is a bit lower. The loading time when using 3 or 4 GPUS is horrendous because of only PCIe 1x for 2 out of the 4 cards. Other than that I'm actually quite happy with the performance...But I don't really know what to expect. Its not the fastest but, but the amount of VRAM for the price its not bad.

I haven't tried any bigger models yet because of the issue I was having, but I'm currently downloading a quant of Goliath 120b to test.

openhermes-2.5-mistral-7b Q8_0
1 GPU, 4 threads
llama_print_timings:        load time =    2493,93 ms
llama_print_timings:      sample time =      39,35 ms /   382 runs   (    0,10 ms per token,  9707,26 tokens per second)
llama_print_timings: prompt eval time =     421,79 ms /    13 tokens (   32,45 ms per token,    30,82 tokens per second)
llama_print_timings:        eval time =   12719,23 ms /   381 runs   (   33,38 ms per token,    29,95 tokens per second)
llama_print_timings:       total time =   13247,96 ms
Log end

openhermes-2.5-mistral-7b Q8_0
1 GPU, 8 threads
llama_print_timings:        load time =    2493,50 ms
llama_print_timings:      sample time =      26,35 ms /   252 runs   (    0,10 ms per token,  9565,02 tokens per second)
llama_print_timings: prompt eval time =     421,66 ms /    13 tokens (   32,44 ms per token,    30,83 tokens per second)
llama_print_timings:        eval time =    8252,78 ms /   251 runs   (   32,88 ms per token,    30,41 tokens per second)
llama_print_timings:       total time =    8746,17 ms
Log end

openhermes-2.5-mistral-7b Q8_0
2 GPUS, 8 threads
llama_print_timings:        load time =    2484,57 ms
llama_print_timings:      sample time =      53,00 ms /   469 runs   (    0,11 ms per token,  8848,72 tokens per second)
llama_print_timings: prompt eval time =     344,77 ms /    13 tokens (   26,52 ms per token,    37,71 tokens per second)
llama_print_timings:        eval time =   14292,65 ms /   468 runs   (   30,54 ms per token,    32,74 tokens per second)
llama_print_timings:       total time =   14839,15 ms
Log end

openhermes-2.5-mistral-7b Q8_0
2 GPUS, 4 threads
llama_print_timings:        load time =    2469,39 ms
llama_print_timings:      sample time =      35,05 ms /   317 runs   (    0,11 ms per token,  9045,25 tokens per second)
llama_print_timings: prompt eval time =     344,69 ms /    13 tokens (   26,52 ms per token,    37,71 tokens per second)
llama_print_timings:        eval time =    9451,28 ms /   316 runs   (   29,91 ms per token,    33,43 tokens per second)
llama_print_timings:       total time =    9932,25 ms
Log end

Mixtral-8x7b-instruct-v.01 Q4_K_M
2 GPUS, 8 threads
llama_print_timings:        load time =    8300,24 ms
llama_print_timings:      sample time =      42,52 ms /   392 runs   (    0,11 ms per token,  9219,41 tokens per second)
llama_print_timings: prompt eval time =    1107,41 ms /    13 tokens (   85,19 ms per token,    11,74 tokens per second)
llama_print_timings:        eval time =   21596,15 ms /   391 runs   (   55,23 ms per token,    18,11 tokens per second)
llama_print_timings:       total time =   22817,07 ms
Log end

Mixtral-8x7b-instruct-v.01 Q4_K_M
2 GPUS, 4 threads
llama_print_timings:        load time =    8059,38 ms
llama_print_timings:      sample time =      45,76 ms /   414 runs   (    0,11 ms per token,  9047,60 tokens per second)
llama_print_timings: prompt eval time =    1109,74 ms /    13 tokens (   85,36 ms per token,    11,71 tokens per second)
llama_print_timings:        eval time =   22879,61 ms /   413 runs   (   55,40 ms per token,    18,05 tokens per second)
llama_print_timings:       total time =   24111,31 ms
Log end

Mixtral-8x7b-instruct-v.01 Q4_K_M
3 GPUS, 8 threads
llama_print_timings:        load time =   27308,16 ms
llama_print_timings:      sample time =      60,46 ms /   527 runs   (    0,11 ms per token,  8715,93 tokens per second)
llama_print_timings: prompt eval time =    1192,08 ms /    13 tokens (   91,70 ms per token,    10,91 tokens per second)
llama_print_timings:        eval time =   32924,88 ms /   526 runs   (   62,59 ms per token,    15,98 tokens per second)
llama_print_timings:       total time =   34276,02 ms
Log end

Mixtral-8x7b-instruct-v.01 Q6_K
3 GPUS, Auto N threads
llama_print_timings:        load time =  232688,58 ms
llama_print_timings:      sample time =      28,90 ms /   258 runs   (    0,11 ms per token,  8928,26 tokens per second)
llama_print_timings: prompt eval time =    1373,24 ms /    13 tokens (  105,63 ms per token,     9,47 tokens per second)
llama_print_timings:        eval time =   16771,73 ms /   257 runs   (   65,26 ms per token,    15,32 tokens per second)
llama_print_timings:       total time =   18245,46 ms
Log end

Mixtral-8x7b-instruct-v.01 Q6_K
3 GPUS, 8 threads
llama_print_timings:        load time =  237449,99 ms
llama_print_timings:      sample time =      49,08 ms /   445 runs   (    0,11 ms per token,  9066,46 tokens per second)
llama_print_timings: prompt eval time =    1372,15 ms /    13 tokens (  105,55 ms per token,     9,47 tokens per second)
llama_print_timings:        eval time =   29291,68 ms /   444 runs   (   65,97 ms per token,    15,16 tokens per second)
llama_print_timings:       total time =   30816,67 ms
Log end

Mixtral-8x7b-instruct-v.01 Q6_K
4 GPUS, Auto N threads
llama_print_timings:        load time =  330726,23 ms
llama_print_timings:      sample time =      36,15 ms /   312 runs   (    0,12 ms per token,  8630,71 tokens per second)
llama_print_timings: prompt eval time =    1382,63 ms /    13 tokens (  106,36 ms per token,     9,40 tokens per second)
llama_print_timings:        eval time =   21865,47 ms /   311 runs   (   70,31 ms per token,    14,22 tokens per second)
llama_print_timings:       total time =   23362,63 ms
Log end

Mixtral-8x7b-instruct-v.01 Q6_K
4 GPUS, 8 threads
llama_print_timings:        load time =  391463,63 ms
llama_print_timings:      sample time =      55,52 ms /   501 runs   (    0,11 ms per token,  9023,29 tokens per second)
llama_print_timings: prompt eval time =    1400,78 ms /    13 tokens (  107,75 ms per token,     9,28 tokens per second)
llama_print_timings:        eval time =   35364,99 ms /   500 runs   (   70,73 ms per token,    14,14 tokens per second)
llama_print_timings:       total time =   36936,06 ms
Log end

Goliath-120b.Q3_K_M
4 GPUS, Auto N threads
llama_print_timings:        load time =  881382,35 ms
llama_print_timings:      sample time =      94,79 ms /   820 runs   (    0,12 ms per token,  8651,07 tokens per second)
llama_print_timings: prompt eval time =    4519,14 ms /    13 tokens (  347,63 ms per token,     2,88 tokens per second)
llama_print_timings:        eval time =  282743,99 ms /   819 runs   (  345,23 ms per token,     2,90 tokens per second)
llama_print_timings:       total time =  288301,41 ms
Log end
leucome commented 9 months ago

Just to give an idea how it compare to a gaming GPU. For Mixtal 8Xq4 with 22 layer the GPU is almost full though it is still usable with the rest on the CPU. I dont have any big 120b model.

./llama-bench -m /AI/llama.cpp/llama.cpp/models/AI/dolphin-2.0-mistral-7b.Q8_0.gguf -b 512 -t 6 -ngl 33 Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no model size params backend ngl test t/s
llama 7B Q8_0 7.17 GiB 7.24 B ROCm 33 pp 512 2402.21 ± 34.47
llama 7B Q8_0 7.17 GiB 7.24 B ROCm 33 tg 128 67.20 ± 0.02
-- More details --
 | llama 7B Q8_0                  |   7.17 GiB |     7.24 B | ROCm       |  33 | pp 512     |  2376.61 ± 80.07 |
 llama_print_timings:        load time =    1193.32 ms
 llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
 llama_print_timings: prompt eval time =    1158.94 ms /  2562 tokens (    0.45 ms per token,  2210.64 tokens per second)
 llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
 llama_print_timings:       total time =    2271.03 ms
 llama_new_context_with_model: n_ctx      = 128
 llama_new_context_with_model: freq_base  = 10000.0
 llama_new_context_with_model: freq_scale = 1
 llama_kv_cache_init: VRAM kv self = 16.00 MB
 llama_new_context_with_model: KV self size  =   16.00 MiB, K (f16):    8.00 MiB, V (f16):    8.00 MiB
 llama_build_graph: non-view tensors processed: 676/676
 llama_new_context_with_model: compute buffer total size = 23.25 MiB
 llama_new_context_with_model: VRAM scratch buffer: 20.06 MiB
 llama_new_context_with_model: total VRAM used: 7241.89 MiB (model: 7205.83 MiB, context: 36.06 MiB)
 | llama 7B Q8_0                  |   7.17 GiB |     7.24 B | ROCm       |  33 | tg 128     |     67.24 ± 0.03 |
 llama_print_timings:        load time =    2292.88 ms
 llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
 llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
 llama_print_timings:        eval time =    9533.84 ms /   641 runs   (   14.87 ms per token,    67.23 tokens per second)
 llama_print_timings:       total time =   11811.03 ms  
 
./llama-bench -m /AI/llama.cpp/llama.cpp/models/AI/Mixtral-4x7B-DPO-RPChat.q5_k_m.gguf -b 512 -t 6 -ngl 33 Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no model size params backend ngl test t/s
llama 7B Q5_K - Medium 15.49 GiB 24.15 B ROCm 33 pp 512 458.22 ± 4.33
llama 7B Q5_K - Medium 15.49 GiB 24.15 B ROCm 33 tg 128 39.18 ± 0.02
-- More details --
| llama 7B Q5_K - Medium         |  15.49 GiB |    24.15 B | ROCm       |  33 | pp 512     |    458.13 ± 4.80 |
llama_print_timings:        load time =    2493.31 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =    5813.72 ms /  2562 tokens (    2.27 ms per token,   440.68 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =    8081.19 ms
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 16.00 MB
llama_new_context_with_model: KV self size  =   16.00 MiB, K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 31.82 MiB
llama_new_context_with_model: VRAM scratch buffer: 28.63 MiB
llama_new_context_with_model: total VRAM used: 15821.68 MiB (model: 15777.05 MiB, context: 44.63 MiB)
| llama 7B Q5_K - Medium         |  15.49 GiB |    24.15 B | ROCm       |  33 | tg 128     |     38.96 ± 0.01 |
llama_print_timings:        load time =    8112.36 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   16452.76 ms /   641 runs   (   25.67 ms per token,    38.96 tokens per second)
llama_print_timings:       total time =   24538.19 ms
 
./llama-bench -m /AI/llama.cpp/llama.cpp/models/AI/dolphin-2.6-mixtral-8x7b.Q4_K_M.gguf -b 512 -t 6 -ngl 22 Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no model size params backend ngl test t/s
llama 7B Q4_K - Medium 24.62 GiB 46.70 B ROCm 22 pp 512 241.64 ± 4.83
llama 7B Q4_K - Medium 24.62 GiB 46.70 B ROCm 22 tg 128 13.00 ± 0.16
-- More details --
| llama 7B Q4_K - Medium         |  24.62 GiB |    46.70 B | ROCm       |  22 | pp 512     |   232.75 ± 21.13 |
llama_print_timings:        load time =    2566.41 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =   11299.75 ms /  2562 tokens (    4.41 ms per token,   226.73 tokens per second)
llama_print_timings:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =   13644.55 ms
llama_new_context_with_model: n_ctx      = 128
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: VRAM kv self = 11.00 MB
llama_new_context_with_model: KV self size  =   16.00 MiB, K (f16):    8.00 MiB, V (f16):    8.00 MiB
llama_build_graph: non-view tensors processed: 1124/1124
llama_new_context_with_model: compute buffer total size = 31.82 MiB
llama_new_context_with_model: VRAM scratch buffer: 28.63 MiB
llama_new_context_with_model: total VRAM used: 17256.70 MiB (model: 17217.06 MiB, context: 39.63 MiB)
| llama 7B Q4_K - Medium         |  24.62 GiB |    46.70 B | ROCm       |  22 | tg 128     |     12.97 ± 0.10 |
llama_print_timings:        load time =   13737.09 ms
llama_print_timings:      sample time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_print_timings:        eval time =   49417.55 ms /   641 runs   (   77.09 ms per token,    12.97 tokens per second)
llama_print_timings:       total time =   63067.55 ms
 
github-actions[bot] commented 7 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 6 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.