ggerganov / llama.cpp

LLM inference in C/C++
MIT License
67.32k stars 9.67k forks source link

Bug: Quantized kv cache caused performance drop on Apple silicon #8918

Closed njsyw1997 closed 1 month ago

njsyw1997 commented 2 months ago

What happened?

Observed performance drop for quantized kv cache with flash attention. Both pp and tg are 1/3 slower when flash attention enable and one of k,v is quantized. Here are some benchmark results on iPhone 14(A15) metal. I observed similar results on my M1 mac with llama-bench too. FP16 kv cache model size params backend test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal pp 512 114.81 ± 12.26
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal tg 128 9.72 ± 0.04
Fp16 kv cache with flash attention model size params backend test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal pp 512 126.93 ± 7.30
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal tg 128 9.95 ± 0.04
q8_0 k, fp 16 v model size params backend test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal pp 512 117.71 ± 10.46
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal tg 128 9.81 ± 0.04
q8_0 k, fp 16 v with flash attention model size params backend test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal pp 512 85.58 ± 1.93
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal tg 128 6.77 ± 0.01
fp16 k, q8_0 k with flash attention model size params backend test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal pp 512 77.47 ± 3.22
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal tg 128 6.69 ± 0.00
q8_0 k,q8_0 v with flash attention model size params backend test t/s
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal pp 512 80.99 ± 1.72
phi3 3B Q4_K - Medium 2.23 GiB 3.82 B Metal tg 128 6.78 ± 0.10

Name and Version

./llama.swiftui, version 3488(75af08c47)

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

ggerganov commented 2 months ago

Metal does not implement non-F16 flash-attention kernels yet, so the operation will be executed on the CPU as a fallback:

https://github.com/ggerganov/llama.cpp/blob/ebd541a5705b6f7a4ce67824d1c2d4fc790f1770/ggml/src/ggml-metal.m#L791-L801

njsyw1997 commented 2 months ago

Can we just simply dequantize kv cache and send it to the fp16 kernel? This might be faster than fall back to CPU. Or we need a new flash_attn_ext_q8_t to get high performance.

ggerganov commented 2 months ago

Dequantizing + f16 kernel won't be performant enough. Need to add dedicated kernels that work with quantized data directly

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.