Open generalsvr opened 11 months ago
On the other hand, why is the load time so much better with llama-cpp-python
than llama.cpp
on your end? o.O
@generalsvr this could be due to the kqv cache not being offloaded previously, try setting offload_kqv=True
on the class or passing it as an environment variable.
This is correct Abetlen - but this could use a wider distribution as a piece of information.
I have been struggling with this for awhile - and found this post from 3-weeks ago today:
https://github.com/abetlen/llama-cpp-python/issues/1054
However, I don't see this nugget of information anywhere in the documentation as a requirement for GPU usage - I mean this is as important as CUBLAS isn't it?
I tried to play with mixtral Q5_K_M quant both on llama.cpp and python. Both servers use cuda 12, but same on 11.6. Here are some results:
llama cpp, A100
llama cpp, H100
llama-cpp-python, A100
llama-cpp-python, H100
So every time I can see a 10-50% performance drop on the same settings both prompt eval time and generation. Is this behavior expected?