Closed njsyw1997 closed 1 month ago
Metal does not implement non-F16 flash-attention kernels yet, so the operation will be executed on the CPU as a fallback:
Can we just simply dequantize kv cache and send it to the fp16 kernel? This might be faster than fall back to CPU. Or we need a new flash_attn_ext_q8_t to get high performance.
Dequantizing + f16 kernel won't be performant enough. Need to add dedicated kernels that work with quantized data directly
This issue was closed because it has been inactive for 14 days since being marked as stale.
What happened?
Name and Version
./llama.swiftui, version 3488(75af08c47)
What operating system are you seeing the problem on?
Mac
Relevant log output
No response