LostRuins / koboldcpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.66k stars 334 forks source link

Substantially slower than llama.cpp on Mac m1 ultra #695

Closed davideuler closed 6 months ago

davideuler commented 6 months ago

Run by koboldcpp

When I run koboldcpp for miqu-70b gguf, It is super slow, about 1 token/s. The command to run is as:

python3.10 ~/workspace/koboldcpp/koboldcpp.py miqu-1-70b.q5_K_M.gguf 8501

Even if I specify parameter for gpu layers, it is the same.

python3.10 ~/workspace/koboldcpp/koboldcpp.py --gpulayers  35 --usemlock --threads 16 miqu-1-70b.q5_K_M.gguf 8501

And I check the GPU usage by sudo asitop, then found that the GPU power usage is almost near to 0. koboldcpp is using CPU for inference on my Mac m1. With a very long wait time(about 30 seconds), and slow inference speed. (about 1 tokens/s)

image

Run by llama.cpp

While when I run it by llama.cpp, it is fast with little wait time. About 2-3 seconds wait time. The inference speed is near 5 tokens/s. And most of the power usage is spent on the GPUs.

~/workspace/llama.cpp/main -t 10 -np 2 -ngl 140 -m ./miqu-1-70b.q5_K_M.gguf --color -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1 -p "hello world in cpp, keep it simple"
image

The output :

llama_print_timings:        load time =    2612.47 ms
llama_print_timings:      sample time =       3.85 ms /    40 runs   (    0.10 ms per token, 10381.52 tokens per second)
llama_print_timings: prompt eval time =    1187.30 ms /    10 tokens (  118.73 ms per token,     8.42 tokens per second)
llama_print_timings:        eval time =    7726.59 ms /    39 runs   (  198.12 ms per token,     5.05 tokens per second)
llama_print_timings:       total time =    8928.49 ms /    49 tokens
ggml_metal_free: deallocating
Log end
Log end
LostRuins commented 6 months ago

You need to build it with LLAMA_METAL=1, and also you need to specify GPU layers to offload. did you do that?

davideuler commented 6 months ago

You need to build it with LLAMA_METAL=1, and also you need to specify GPU layers to offload. did you do that?

Thanks, I rebuilt koboldcpp with LLAMA_META=1, and start the service with GPU layers. Now it works on Meta GPUs. It's super fast now.

python3.10 /Users/david/workspace/koboldcpp/koboldcpp.py --gpulayers 80 --threads 8 miqu-1-70b.q5_K_M.gguf   8501