abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.82k stars 933 forks source link

Mixtral A100/H100 performance #1017

Open generalsvr opened 9 months ago

generalsvr commented 9 months ago

I tried to play with mixtral Q5_K_M quant both on llama.cpp and python. Both servers use cuda 12, but same on 11.6. Here are some results:

llama cpp, A100

llama_print_timings:        load time =    5438.24 ms
llama_print_timings:      sample time =      92.38 ms /   243 runs   (    0.38 ms per token,  2630.35 tokens per second)
llama_print_timings: prompt eval time =     105.61 ms /     9 tokens (   11.73 ms per token,    85.22 tokens per second)
llama_print_timings:        eval time =    5031.24 ms /   242 runs   (   20.79 ms per token,    48.10 tokens per second)
llama_print_timings:       total time =    5324.31 ms

llama cpp, H100

llama_print_timings:        load time =    4422.18 ms
llama_print_timings:      sample time =      28.22 ms /   190 runs   (    0.15 ms per token,  6732.81 tokens per second)
llama_print_timings: prompt eval time =      77.82 ms /     9 tokens (    8.65 ms per token,   115.65 tokens per second)
llama_print_timings:        eval time =    3114.78 ms /   189 runs   (   16.48 ms per token,    60.68 tokens per second)
llama_print_timings:       total time =    3272.63 ms

llama-cpp-python, A100

llama_print_timings:        load time =    3054.10 ms
llama_print_timings:      sample time =     113.27 ms /   269 runs   (    0.42 ms per token,  2374.92 tokens per second)
llama_print_timings: prompt eval time =    3053.75 ms /   243 tokens (   12.57 ms per token,    79.57 tokens per second)
llama_print_timings:        eval time =   11053.97 ms /   268 runs   (   41.25 ms per token,    24.24 tokens per second)
llama_print_timings:       total time =   14967.72 ms

llama-cpp-python, H100


llama_print_timings:        load time =    2572.72 ms
llama_print_timings:      sample time =      44.71 ms /   269 runs   (    0.17 ms per token,  6016.15 tokens per second)
llama_print_timings: prompt eval time =    2572.38 ms /   243 tokens (   10.59 ms per token,    94.47 tokens per second)
llama_print_timings:        eval time =    7424.27 ms /   268 runs   (   27.70 ms per token,    36.10 tokens per second)
llama_print_timings:       total time =   10591.14 ms

So every time I can see a 10-50% performance drop on the same settings both prompt eval time and generation. Is this behavior expected?

m-from-space commented 9 months ago

On the other hand, why is the load time so much better with llama-cpp-python than llama.cpp on your end? o.O

abetlen commented 9 months ago

@generalsvr this could be due to the kqv cache not being offloaded previously, try setting offload_kqv=True on the class or passing it as an environment variable.

Kaotic3 commented 8 months ago

This is correct Abetlen - but this could use a wider distribution as a piece of information.

I have been struggling with this for awhile - and found this post from 3-weeks ago today:

https://github.com/abetlen/llama-cpp-python/issues/1054

However, I don't see this nugget of information anywhere in the documentation as a requirement for GPU usage - I mean this is as important as CUBLAS isn't it?