abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
8.18k stars 974 forks source link

Llama.generate: prefix-match hit is very slow. #1437

Open ndy200 opened 6 months ago

ndy200 commented 6 months ago

I upgraded from an older version, and experienced a disturbingly long read-ahead time. The load on my machine is about the same (a bit higher with python, but that's understandable) I tried to specify the same environment, using an nvidia card, but with n_gpu_layers=0. For python binding, it may take several seconds to start the response. The token generation itself is done at a similar speed, but for llama.cpp the response starts immediately, while for python binding it takes seconds. I would like to know if I am the only one experiencing this? I am using LLama3 model.

So, the original binary values are: llama_print_timings: sample time = 92.31 ms / 1160 runs ( 0.08 ms per token, 12565.67 tokens per second) The llamaPython's are: llama_print_timings: sample time = 99.82 ms / 144 runs ( 0.69 ms per token, 1442.57 tokens per second)

This seems like a big difference.

woheller69 commented 6 months ago

https://github.com/Maximilian-Winter/llama-cpp-agent/issues/54 Probably that is related to my findings that llama-cpp-python with llama-cpp-agent is slower than gpt4all on the follow-up prompts. First prompt is fast.

woheller69 commented 6 months ago

Related to #1369 ?

aoom commented 2 months ago

Encountered a similar problem, which manifested itself as loading the model abnormally slow under gpu (unified memory for arm platforms) and only using a single core single thread for the cpu. This problem only exists in the last few new releases. It was working fine before.