ggerganov / llama.cpp

LLM inference in C/C++
MIT License
62.64k stars 8.99k forks source link

Performance degradation with P40 on larger models #6814

Closed samr7 closed 3 months ago

samr7 commented 3 months ago

I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.

I build llama.cpp using: cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on

Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:

bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128

b1691: 10.76 t/s b1767: 9.75 t/s b1808: 9.76 t/s b1832: 9.77 t/s b1842: 9.76 t/s b1843: 3.73 t/s b2400: 3.83 t/s b2709: 3.84 t/s

Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:

Model b1842 b1843 b2709
Synthia-70b-v1.2.Q8_0 9.76 t/s 3.73 t/s 3.84 t/s
phind-codellama-34b-v2.Q8_0 16.99 t/s 7.54 t/s 7.78 t/s
llama-2-13b-Q8_0 21.10 t/s 17.67 t/s 18.63 t/s
Meta-Llama-3-8B-Instruct.Q8_0 25.66 t/s 33.27 t/s 31.83 t/s

Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:

GPUs b1842 b1843 b2709
8 9.76 t/s 3.73 t/s 3.84 t/s
4 9.61 t/s 3.77 t/s 3.89 t/s
3 8.32 t/s 3.77 t/s 3.91 t/s

Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:

Threads b1842 b2709
-t 1 10.05 t/s 3.90 t/s
-t 4 10.06 t/s 3.90 t/s
-t 8 10.09 t/s 3.90 t/s

The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.

Any ideas?

slaren commented 3 months ago

Try -sm row.

samr7 commented 3 months ago

-sm row seems to improve things a lot! Thanks.