abetlen / llama-cpp-python

Python bindings for llama.cpp
https://llama-cpp-python.readthedocs.io
MIT License
7.64k stars 917 forks source link

llama cpp python server for llava slow token per second #1354

Open Kev1ntan opened 4 months ago

Kev1ntan commented 4 months ago

Darwin Feedloops-Mac-Studio-2.local 23.3.0 Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:00 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6020 arm64

command: python -m llama_cpp.server --model ./llava-v1.6-mistral-7b.Q8_0.gguf --port 9007 --host localhost --n_gpu_layers 33 --chat_format chatml --clip_model_path ./mmproj-mistral7b-f16.gguf

curl --location 'http://localhost:9007/v1/chat/completions' \ --header 'Authorization: Bearer 1n66q24dexb1cc8abc62b185dee0dd802pn92' \ --header 'Content-Type: application/json' \ --data '{ "messages": [ { "role": "user", "content": [ { "type": "text", "text": "hello" } ] } ], "max_tokens": 1000, "temperature": 0, }'

INFO: Started server process [71075] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://localhost:9007 (Press CTRL+C to quit)

llama_print_timings: load time = 1491.98 ms llama_print_timings: sample time = 2.17 ms / 26 runs ( 0.08 ms per token, 12009.24 tokens per second) llama_print_timings: prompt eval time = 1491.90 ms / 37 tokens ( 40.32 ms per token, 24.80 tokens per second) llama_print_timings: eval time = 66226.55 ms / 25 runs ( 2649.06 ms per token, 0.38 tokens per second) llama_print_timings: total time = 67791.77 ms / 62 tokens INFO: ::1:55485 - "POST /v1/chat/completions HTTP/1.1" 200 OK

can someone help? thanks

kinchahoy commented 4 months ago

I ran into this also. I think I know what the problem is.

The CMake config in LLama.cpp is currently not optimizing for native architectures to fix an issue with MOE (https://github.com/ggerganov/llama.cpp/pull/6716). You'll get much slower performance with CMake on LLama.cpp right now. It took me a while to realize that Llama-cpp-python uses the CMake build pathway in LLama.cpp not Make.

An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)

I gained a ton of performance that way.

shelbywhite commented 4 months ago

An easy fix is to install llama-cpp-python from the repo, then edit extern/llama.cpp (where Llama-cpp-python downloads llama.cpp as a sub-repo) and modify the CMakeLists.txt to say set(LLAMA_LLAMAFILE_DEFAULT ON)

@kinchahoy do you by chance know the tokens/sec performance before applying your fix and then after?

kinchahoy commented 4 months ago

I'm doing kinda complex things but it took something that was taking 24s to encode down to 11-12s.