Closed stejpet closed 6 months ago
Hi @stejpet
Can you share a bit more about your experience with llama.cpp
and MI25 apart from the issue you are having.
Is the performance good?
Which models you are using, their size?
Quantization level you are using?
Since MI25 is the most affordable accelerator one may have (apart from A$ MAX), it would be great starting point.
I'm curios about the performance of an MI25 also. I have a pair of MI100s and a pair of W6800s and the W6800s are a lot faster than the MI100s. I just don't think llama.cpp/Koboldcpp have the correct tunings for the MI cards.
@Dragomir-Ivanov @ccbadd
CPU: Ryzen 9 5900X, 32GB RAM, 500GB m.2 SSD. GPU: 4 x AMD Instinct MI25, flashed to AMD Pro WX9100 firmware. I think it limits the performance a bit compared to the original firmware because the wattage is a bit lower. The loading time when using 3 or 4 GPUS is horrendous because of only PCIe 1x for 2 out of the 4 cards. Other than that I'm actually quite happy with the performance...But I don't really know what to expect. Its not the fastest but, but the amount of VRAM for the price its not bad.
I haven't tried any bigger models yet because of the issue I was having, but I'm currently downloading a quant of Goliath 120b to test.
openhermes-2.5-mistral-7b Q8_0
1 GPU, 4 threads
llama_print_timings: load time = 2493,93 ms
llama_print_timings: sample time = 39,35 ms / 382 runs ( 0,10 ms per token, 9707,26 tokens per second)
llama_print_timings: prompt eval time = 421,79 ms / 13 tokens ( 32,45 ms per token, 30,82 tokens per second)
llama_print_timings: eval time = 12719,23 ms / 381 runs ( 33,38 ms per token, 29,95 tokens per second)
llama_print_timings: total time = 13247,96 ms
Log end
openhermes-2.5-mistral-7b Q8_0
1 GPU, 8 threads
llama_print_timings: load time = 2493,50 ms
llama_print_timings: sample time = 26,35 ms / 252 runs ( 0,10 ms per token, 9565,02 tokens per second)
llama_print_timings: prompt eval time = 421,66 ms / 13 tokens ( 32,44 ms per token, 30,83 tokens per second)
llama_print_timings: eval time = 8252,78 ms / 251 runs ( 32,88 ms per token, 30,41 tokens per second)
llama_print_timings: total time = 8746,17 ms
Log end
openhermes-2.5-mistral-7b Q8_0
2 GPUS, 8 threads
llama_print_timings: load time = 2484,57 ms
llama_print_timings: sample time = 53,00 ms / 469 runs ( 0,11 ms per token, 8848,72 tokens per second)
llama_print_timings: prompt eval time = 344,77 ms / 13 tokens ( 26,52 ms per token, 37,71 tokens per second)
llama_print_timings: eval time = 14292,65 ms / 468 runs ( 30,54 ms per token, 32,74 tokens per second)
llama_print_timings: total time = 14839,15 ms
Log end
openhermes-2.5-mistral-7b Q8_0
2 GPUS, 4 threads
llama_print_timings: load time = 2469,39 ms
llama_print_timings: sample time = 35,05 ms / 317 runs ( 0,11 ms per token, 9045,25 tokens per second)
llama_print_timings: prompt eval time = 344,69 ms / 13 tokens ( 26,52 ms per token, 37,71 tokens per second)
llama_print_timings: eval time = 9451,28 ms / 316 runs ( 29,91 ms per token, 33,43 tokens per second)
llama_print_timings: total time = 9932,25 ms
Log end
Mixtral-8x7b-instruct-v.01 Q4_K_M
2 GPUS, 8 threads
llama_print_timings: load time = 8300,24 ms
llama_print_timings: sample time = 42,52 ms / 392 runs ( 0,11 ms per token, 9219,41 tokens per second)
llama_print_timings: prompt eval time = 1107,41 ms / 13 tokens ( 85,19 ms per token, 11,74 tokens per second)
llama_print_timings: eval time = 21596,15 ms / 391 runs ( 55,23 ms per token, 18,11 tokens per second)
llama_print_timings: total time = 22817,07 ms
Log end
Mixtral-8x7b-instruct-v.01 Q4_K_M
2 GPUS, 4 threads
llama_print_timings: load time = 8059,38 ms
llama_print_timings: sample time = 45,76 ms / 414 runs ( 0,11 ms per token, 9047,60 tokens per second)
llama_print_timings: prompt eval time = 1109,74 ms / 13 tokens ( 85,36 ms per token, 11,71 tokens per second)
llama_print_timings: eval time = 22879,61 ms / 413 runs ( 55,40 ms per token, 18,05 tokens per second)
llama_print_timings: total time = 24111,31 ms
Log end
Mixtral-8x7b-instruct-v.01 Q4_K_M
3 GPUS, 8 threads
llama_print_timings: load time = 27308,16 ms
llama_print_timings: sample time = 60,46 ms / 527 runs ( 0,11 ms per token, 8715,93 tokens per second)
llama_print_timings: prompt eval time = 1192,08 ms / 13 tokens ( 91,70 ms per token, 10,91 tokens per second)
llama_print_timings: eval time = 32924,88 ms / 526 runs ( 62,59 ms per token, 15,98 tokens per second)
llama_print_timings: total time = 34276,02 ms
Log end
Mixtral-8x7b-instruct-v.01 Q6_K
3 GPUS, Auto N threads
llama_print_timings: load time = 232688,58 ms
llama_print_timings: sample time = 28,90 ms / 258 runs ( 0,11 ms per token, 8928,26 tokens per second)
llama_print_timings: prompt eval time = 1373,24 ms / 13 tokens ( 105,63 ms per token, 9,47 tokens per second)
llama_print_timings: eval time = 16771,73 ms / 257 runs ( 65,26 ms per token, 15,32 tokens per second)
llama_print_timings: total time = 18245,46 ms
Log end
Mixtral-8x7b-instruct-v.01 Q6_K
3 GPUS, 8 threads
llama_print_timings: load time = 237449,99 ms
llama_print_timings: sample time = 49,08 ms / 445 runs ( 0,11 ms per token, 9066,46 tokens per second)
llama_print_timings: prompt eval time = 1372,15 ms / 13 tokens ( 105,55 ms per token, 9,47 tokens per second)
llama_print_timings: eval time = 29291,68 ms / 444 runs ( 65,97 ms per token, 15,16 tokens per second)
llama_print_timings: total time = 30816,67 ms
Log end
Mixtral-8x7b-instruct-v.01 Q6_K
4 GPUS, Auto N threads
llama_print_timings: load time = 330726,23 ms
llama_print_timings: sample time = 36,15 ms / 312 runs ( 0,12 ms per token, 8630,71 tokens per second)
llama_print_timings: prompt eval time = 1382,63 ms / 13 tokens ( 106,36 ms per token, 9,40 tokens per second)
llama_print_timings: eval time = 21865,47 ms / 311 runs ( 70,31 ms per token, 14,22 tokens per second)
llama_print_timings: total time = 23362,63 ms
Log end
Mixtral-8x7b-instruct-v.01 Q6_K
4 GPUS, 8 threads
llama_print_timings: load time = 391463,63 ms
llama_print_timings: sample time = 55,52 ms / 501 runs ( 0,11 ms per token, 9023,29 tokens per second)
llama_print_timings: prompt eval time = 1400,78 ms / 13 tokens ( 107,75 ms per token, 9,28 tokens per second)
llama_print_timings: eval time = 35364,99 ms / 500 runs ( 70,73 ms per token, 14,14 tokens per second)
llama_print_timings: total time = 36936,06 ms
Log end
Goliath-120b.Q3_K_M
4 GPUS, Auto N threads
llama_print_timings: load time = 881382,35 ms
llama_print_timings: sample time = 94,79 ms / 820 runs ( 0,12 ms per token, 8651,07 tokens per second)
llama_print_timings: prompt eval time = 4519,14 ms / 13 tokens ( 347,63 ms per token, 2,88 tokens per second)
llama_print_timings: eval time = 282743,99 ms / 819 runs ( 345,23 ms per token, 2,90 tokens per second)
llama_print_timings: total time = 288301,41 ms
Log end
Just to give an idea how it compare to a gaming GPU. For Mixtal 8Xq4 with 22 layer the GPU is almost full though it is still usable with the rest on the CPU. I dont have any big 120b model.
./llama-bench -m /AI/llama.cpp/llama.cpp/models/AI/dolphin-2.0-mistral-7b.Q8_0.gguf -b 512 -t 6 -ngl 33 Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|---|
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 33 | pp 512 | 2402.21 ± 34.47 | |
llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 33 | tg 128 | 67.20 ± 0.02 |
| llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 33 | pp 512 | 2376.61 ± 80.07 | llama_print_timings: load time = 1193.32 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 1158.94 ms / 2562 tokens ( 0.45 ms per token, 2210.64 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 2271.03 ms llama_new_context_with_model: n_ctx = 128 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 16.00 MB llama_new_context_with_model: KV self size = 16.00 MiB, K (f16): 8.00 MiB, V (f16): 8.00 MiB llama_build_graph: non-view tensors processed: 676/676 llama_new_context_with_model: compute buffer total size = 23.25 MiB llama_new_context_with_model: VRAM scratch buffer: 20.06 MiB llama_new_context_with_model: total VRAM used: 7241.89 MiB (model: 7205.83 MiB, context: 36.06 MiB) | llama 7B Q8_0 | 7.17 GiB | 7.24 B | ROCm | 33 | tg 128 | 67.24 ± 0.03 | llama_print_timings: load time = 2292.88 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) llama_print_timings: eval time = 9533.84 ms / 641 runs ( 14.87 ms per token, 67.23 tokens per second) llama_print_timings: total time = 11811.03 ms
./llama-bench -m /AI/llama.cpp/llama.cpp/models/AI/Mixtral-4x7B-DPO-RPChat.q5_k_m.gguf -b 512 -t 6 -ngl 33 Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|---|
llama 7B Q5_K - Medium | 15.49 GiB | 24.15 B | ROCm | 33 | pp 512 | 458.22 ± 4.33 | |
llama 7B Q5_K - Medium | 15.49 GiB | 24.15 B | ROCm | 33 | tg 128 | 39.18 ± 0.02 |
| llama 7B Q5_K - Medium | 15.49 GiB | 24.15 B | ROCm | 33 | pp 512 | 458.13 ± 4.80 | llama_print_timings: load time = 2493.31 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 5813.72 ms / 2562 tokens ( 2.27 ms per token, 440.68 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 8081.19 ms llama_new_context_with_model: n_ctx = 128 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 16.00 MB llama_new_context_with_model: KV self size = 16.00 MiB, K (f16): 8.00 MiB, V (f16): 8.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 31.82 MiB llama_new_context_with_model: VRAM scratch buffer: 28.63 MiB llama_new_context_with_model: total VRAM used: 15821.68 MiB (model: 15777.05 MiB, context: 44.63 MiB) | llama 7B Q5_K - Medium | 15.49 GiB | 24.15 B | ROCm | 33 | tg 128 | 38.96 ± 0.01 | llama_print_timings: load time = 8112.36 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) llama_print_timings: eval time = 16452.76 ms / 641 runs ( 25.67 ms per token, 38.96 tokens per second) llama_print_timings: total time = 24538.19 ms
./llama-bench -m /AI/llama.cpp/llama.cpp/models/AI/dolphin-2.6-mixtral-8x7b.Q4_K_M.gguf -b 512 -t 6 -ngl 22 Device 0: AMD Radeon RX 7900 XT, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s |
---|---|---|---|---|---|---|---|
llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | ROCm | 22 | pp 512 | 241.64 ± 4.83 | |
llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | ROCm | 22 | tg 128 | 13.00 ± 0.16 |
| llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | ROCm | 22 | pp 512 | 232.75 ± 21.13 | llama_print_timings: load time = 2566.41 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 11299.75 ms / 2562 tokens ( 4.41 ms per token, 226.73 tokens per second) llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: total time = 13644.55 ms llama_new_context_with_model: n_ctx = 128 llama_new_context_with_model: freq_base = 1000000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: VRAM kv self = 11.00 MB llama_new_context_with_model: KV self size = 16.00 MiB, K (f16): 8.00 MiB, V (f16): 8.00 MiB llama_build_graph: non-view tensors processed: 1124/1124 llama_new_context_with_model: compute buffer total size = 31.82 MiB llama_new_context_with_model: VRAM scratch buffer: 28.63 MiB llama_new_context_with_model: total VRAM used: 17256.70 MiB (model: 17217.06 MiB, context: 39.63 MiB) | llama 7B Q4_K - Medium | 24.62 GiB | 46.70 B | ROCm | 22 | tg 128 | 12.97 ± 0.10 | llama_print_timings: load time = 13737.09 ms llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second) llama_print_timings: prompt eval time = 0.00 ms / 1 tokens ( 0.00 ms per token, inf tokens per second) llama_print_timings: eval time = 49417.55 ms / 641 runs ( 77.09 ms per token, 12.97 tokens per second) llama_print_timings: total time = 63067.55 ms
This issue is stale because it has been open for 30 days with no activity.
This issue was closed because it has been inactive for 14 days since being marked as stale.
Hi. I have noticed that from my quick testing, setting the main GPU is critical for the function of the model.
My setup: 4 x AMD WX9100 / (Instinct MI25), 2 connected via 16x -> 4x 1x PCIe splitters, 2 connected with x16 to mobo. Ubuntu 22.04.3, 6.2.0-39-generic. ROCM 5.7.
Loading a model with all 4 cards cards require that I use -mg 2 or -mg 3. These are the two cards that are connected directly to the mobo 16x slots.
Trying to load a model with either two of the 1x connected cards as main GPU result in garbage being outputted.