I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.
I build llama.cpp using:
cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on
Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:
bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128
Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:
Model
b1842
b1843
b2709
Synthia-70b-v1.2.Q8_0
9.76 t/s
3.73 t/s
3.84 t/s
phind-codellama-34b-v2.Q8_0
16.99 t/s
7.54 t/s
7.78 t/s
llama-2-13b-Q8_0
21.10 t/s
17.67 t/s
18.63 t/s
Meta-Llama-3-8B-Instruct.Q8_0
25.66 t/s
33.27 t/s
31.83 t/s
Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:
GPUs
b1842
b1843
b2709
8
9.76 t/s
3.73 t/s
3.84 t/s
4
9.61 t/s
3.77 t/s
3.89 t/s
3
8.32 t/s
3.77 t/s
3.91 t/s
Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:
Threads
b1842
b2709
-t 1
10.05 t/s
3.90 t/s
-t 4
10.06 t/s
3.90 t/s
-t 8
10.09 t/s
3.90 t/s
The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.
I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.
I build llama.cpp using: cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on
Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:
bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128
b1691: 10.76 t/s b1767: 9.75 t/s b1808: 9.76 t/s b1832: 9.77 t/s b1842: 9.76 t/s b1843: 3.73 t/s b2400: 3.83 t/s b2709: 3.84 t/s
Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:
Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:
Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:
The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.
Any ideas?