Closed ibsidorenko closed 3 weeks ago
Thanks for your suggestions. Sure, I think we can definitely support it, and using fp8 for kv-cache and fp16 for q sounds reasonable to me.
I'll separate the DTypeIn to DTypeQ and DTypeKV in the kernel implementations, and the python APIs doesn't have to change.
Seconding this - I was actually thinking of submitting a PR myself. @yzh119 let me know if you need any help on this (from what I can tell, it should be quite straightforward).
Semi-related, can we expect fp8 support for prefill any time soon? How complicated would it be to add that?
let me know if you need any help on this (from what I can tell, it should be quite straightforward).
Sounds good, I would really appreciate your help!
can we expect fp8 support for prefill any time soon?
Yes we are in the last step of dealing with transposed ldmatrix for fp8 (for V matrix). It should be available soon :)
Ok, let me see if I can get a PR going this week!
Hi, All! This is just a question of whether there are such plans or not...
Right now, Flashinfer lib requires Q (query) and KV (kv-cache) to have the same dtype. Just an example from the code,
q
andpaged_kv
have the sameDTypeIn
:Are there any plans to support different dtypes for KV-cache and Q (query)? My personal interest is
fp8
for kv-cache andfp16
for query.Thank you in advance! cc @yzh119