LostRuins / koboldcpp

A simple one-file way to run various GGML and GGUF models with KoboldAI's UI
https://github.com/lostruins/koboldcpp
GNU Affero General Public License v3.0
4.34k stars 310 forks source link

Flash attention slower #900

Closed Azirine closed 3 weeks ago

Azirine commented 3 weeks ago

Flash attention makes pp and tg slower on koboldcpp, unlike on llama.cpp where flash attention is faster. Speeds are also many times slower than llama.cpp whether with flash attention or not.

python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.gguf --nommap --contextsize 2048 --benchmark

Benchmark Completed - v1.67 Results:

Backend: koboldcpp_default.so Layers: 0 Model: Mistral-7B-Instruct-v0.3-Q8_0 MaxCtx: 2048 GenAmount: 100

ProcessingTime: 20.77s ProcessingSpeed: 93.80T/s GenerationTime: 8.02s GenerationSpeed: 12.47T/s TotalTime: 28.79s Output: 11111

python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.gguf --nommap --contextsize 2048 --benchmark --flashattention

Benchmark Completed - v1.67 Results:

Backend: koboldcpp_default.so Layers: 0 Model: Mistral-7B-Instruct-v0.3-Q8_0 MaxCtx: 2048 GenAmount: 100

ProcessingTime: 29.63s ProcessingSpeed: 65.74T/s GenerationTime: 8.80s GenerationSpeed: 11.36T/s TotalTime: 38.44s Output: 11111

./llama-bench -m Mistral-7B-Instruct-v0.3-Q8_0.gguf -mmp 0 -p 2048 -n 100 -fa 0,1

llama.cpp speeds for comparison

model size params backend ngl fa mmap test t/s
llama 7B Q8_0 7.17 GiB 7.25 B Metal 99 0 0 pp2048 405.40 ± 0.60
llama 7B Q8_0 7.17 GiB 7.25 B Metal 99 0 0 tg100 23.23 ± 0.04
llama 7B Q8_0 7.17 GiB 7.25 B Metal 99 1 0 pp2048 418.66 ± 1.85
llama 7B Q8_0 7.17 GiB 7.25 B Metal 99 1 0 tg100 23.30 ± 0.20

System: MacOS 14.5, M3 Max, low-power mode (to eliminate throttling fluctuations)

abc-nix commented 3 weeks ago

You haven't mentioned it, but I understand you built koboldcpp with LLAMA_META=1 (to enable metal support).

I think that for koboldcpp on metal you need to use the --gpulayers parameters. Try with --gpulayers 99

Also, a warning on llama-bench. Token generation is done with context size of 100 tokens, with 0 initial prompt (it measures prompt processing and token generation separately). It will always be faster compared to koboldcpp's benchmark. I prefer koboldcpp's numbers, as they are more realistic, where generation occurs after prompt processing, on a large context.

Azirine commented 3 weeks ago

You haven't mentioned it, but I understand you built koboldcpp with LLAMA_META=1 (to enable metal support).

I think that for koboldcpp on metal you need to use the --gpulayers parameters. Try with --gpulayers 99

I tried building with LLAMA_METAL=1, with --gpulayers 99, and the results are exactly the same.

Also, a warning on llama-bench. Token generation is done with context size of 100 tokens, with 0 initial prompt (it measures prompt processing and token generation separately). It will always be faster compared to koboldcpp's benchmark. I prefer koboldcpp's numbers, as they are more realistic, where generation occurs after prompt processing, on a large context.

This only affects tg, not pp. As we can see pp is more than 6 times faster on llama.cpp with fa.

Also, with a tiny 2k context tg should not be that much slower than zero context, but again it's more than 2 times faster on llama.cpp.

abc-nix commented 3 weeks ago

Ok. I cannot help debug as I lack an Apple device. What you can do is share the full output of a benchmark test

python3.11 koboldcpp.py Mistral-7B-Instruct-v0.3-Q8_0.gguf --nommap --gpulayers 99 --contextsize 2048 --benchmark --flashattention`

so that others can compare it with their OSX related output.

Azirine commented 3 weeks ago

My mistake, I think something went wrong when building. After building again I managed to get speeds comparable to llama.cpp.