Performance degradation with P40 on larger models

samr7 commented 3 months ago

I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.

I build llama.cpp using: cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on

Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:

bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128

b1691: 10.76 t/s b1767: 9.75 t/s b1808: 9.76 t/s b1832: 9.77 t/s b1842: 9.76 t/s b1843: 3.73 t/s b2400: 3.83 t/s b2709: 3.84 t/s

Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:

Model	b1842	b1843	b2709
Synthia-70b-v1.2.Q8_0	9.76 t/s	3.73 t/s	3.84 t/s
phind-codellama-34b-v2.Q8_0	16.99 t/s	7.54 t/s	7.78 t/s
llama-2-13b-Q8_0	21.10 t/s	17.67 t/s	18.63 t/s
Meta-Llama-3-8B-Instruct.Q8_0	25.66 t/s	33.27 t/s	31.83 t/s

Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:

GPUs	b1842	b1843	b2709
8	9.76 t/s	3.73 t/s	3.84 t/s
4	9.61 t/s	3.77 t/s	3.89 t/s
3	8.32 t/s	3.77 t/s	3.91 t/s

Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:

Threads	b1842	b2709
-t 1	10.05 t/s	3.90 t/s
-t 4	10.06 t/s	3.90 t/s
-t 8	10.09 t/s	3.90 t/s

The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.

Any ideas?

slaren commented 3 months ago

Try -sm row.

samr7 commented 3 months ago

-sm row seems to improve things a lot! Thanks.

ggerganov / llama.cpp

Performance degradation with P40 on larger models #6814