Closed Azirine closed 3 weeks ago
You haven't mentioned it, but I understand you built koboldcpp with LLAMA_META=1 (to enable metal support).
I think that for koboldcpp on metal you need to use the --gpulayers
parameters. Try with --gpulayers 99
Also, a warning on llama-bench. Token generation is done with context size of 100 tokens, with 0 initial prompt (it measures prompt processing and token generation separately). It will always be faster compared to koboldcpp's benchmark. I prefer koboldcpp's numbers, as they are more realistic, where generation occurs after prompt processing, on a large context.
You haven't mentioned it, but I understand you built koboldcpp with LLAMA_META=1 (to enable metal support).
I think that for koboldcpp on metal you need to use the
--gpulayers
parameters. Try with--gpulayers 99
I tried building with LLAMA_METAL=1, with --gpulayers 99, and the results are exactly the same.
Also, a warning on llama-bench. Token generation is done with context size of 100 tokens, with 0 initial prompt (it measures prompt processing and token generation separately). It will always be faster compared to koboldcpp's benchmark. I prefer koboldcpp's numbers, as they are more realistic, where generation occurs after prompt processing, on a large context.
This only affects tg, not pp. As we can see pp is more than 6 times faster on llama.cpp with fa.
Also, with a tiny 2k context tg should not be that much slower than zero context, but again it's more than 2 times faster on llama.cpp.
Ok. I cannot help debug as I lack an Apple device. What you can do is share the full output of a benchmark test
python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.gguf --nommap --gpulayers 99 --contextsize 2048 --benchmark --flashattention`
so that others can compare it with their OSX related output.
My mistake, I think something went wrong when building. After building again I managed to get speeds comparable to llama.cpp.
Flash attention makes pp and tg slower on koboldcpp, unlike on llama.cpp where flash attention is faster. Speeds are also many times slower than llama.cpp whether with flash attention or not.
Benchmark Completed - v1.67 Results:
Backend: koboldcpp_default.so Layers: 0 Model: Mistral-7B-Instruct-v0.3-Q8_0 MaxCtx: 2048 GenAmount: 100
ProcessingTime: 20.77s ProcessingSpeed: 93.80T/s GenerationTime: 8.02s GenerationSpeed: 12.47T/s TotalTime: 28.79s Output: 11111
Benchmark Completed - v1.67 Results:
Backend: koboldcpp_default.so Layers: 0 Model: Mistral-7B-Instruct-v0.3-Q8_0 MaxCtx: 2048 GenAmount: 100
ProcessingTime: 29.63s ProcessingSpeed: 65.74T/s GenerationTime: 8.80s GenerationSpeed: 11.36T/s TotalTime: 38.44s Output: 11111
llama.cpp speeds for comparison
System: MacOS 14.5, M3 Max, low-power mode (to eliminate throttling fluctuations)