Bug: Quantized kv cache caused performance drop on Apple silicon

njsyw1997 commented 2 months ago

What happened?

Observed performance drop for quantized kv cache with flash attention. Both pp and tg are 1/3 slower when flash attention enable and one of k,v is quantized. Here are some benchmark results on iPhone 14(A15) metal. I observed similar results on my M1 mac with llama-bench too. FP16 kv cache	model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	114.81 ± 12.26
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	9.72 ± 0.04

Fp16 kv cache with flash attention	model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	126.93 ± 7.30
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	9.95 ± 0.04

q8_0 k, fp 16 v	model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	117.71 ± 10.46
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	9.81 ± 0.04

q8_0 k, fp 16 v with flash attention	model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	85.58 ± 1.93
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	6.77 ± 0.01

fp16 k, q8_0 k with flash attention	model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	77.47 ± 3.22
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	6.69 ± 0.00

q8_0 k,q8_0 v with flash attention	model	size	params	backend	test	t/s
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	pp 512	80.99 ± 1.72
phi3 3B Q4_K - Medium	2.23 GiB	3.82 B	Metal	tg 128	6.78 ± 0.10

Name and Version

./llama.swiftui, version 3488(75af08c47)

What operating system are you seeing the problem on?

Mac

Relevant log output

No response

ggerganov commented 2 months ago

Metal does not implement non-F16 flash-attention kernels yet, so the operation will be executed on the CPU as a fallback:

https://github.com/ggerganov/llama.cpp/blob/ebd541a5705b6f7a4ce67824d1c2d4fc790f1770/ggml/src/ggml-metal.m#L791-L801

njsyw1997 commented 2 months ago

Can we just simply dequantize kv cache and send it to the fp16 kernel? This might be faster than fall back to CPU. Or we need a new flash_attn_ext_q8_t to get high performance.

ggerganov commented 2 months ago

Dequantizing + f16 kernel won't be performant enough. Need to add dedicated kernels that work with quantized data directly

github-actions[bot] commented 1 month ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov / llama.cpp