flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.45k stars 140 forks source link

Support MLA (Multi-Head Latent Attention) in DeepSeek-v2 #237

Open yzh119 opened 6 months ago

yzh119 commented 6 months ago

MLA(Multi-Head Latency Attention) was proposed in DeepSeek-v2 for efficient inference.

jon-chuang commented 3 months ago

Hello @yzh119, I know about this, it involves merging the latent proj_in and proj_out into the Q/KV proj_in/out

Maybe I will try to take a look this weekend.

jon-chuang commented 3 months ago

Multi-Head Latency Attention

Btw it is Multi-Head Latent Attention

yzh119 commented 3 months ago

Thank you @jon-chuang !

halexan commented 2 months ago

Any updates for deepseek v2?

jason-huang03 commented 1 month ago

Is MLA supported now? If it is supported, could you point out how to use it?

tsu-bin commented 3 weeks ago

Is MLA supported now? If it is supported, could you point out how to use it?

Hi, #551 is the first step to support MLA. The MLA prefill is still need some time.

liangzelang commented 3 days ago

@tsu-bin As you say #551 MLA decode is supported, but MLA prefill is not yet supported; is there a way to use MHA prefill combined with MLA decode to run Deepseek-v2 ? Or any method that can enable the current MLA decode to be used in Deepseek-v2?

tsu-bin commented 3 days ago

hi @liangzelang I'm afraid you still can't use MHA prefill kernel to support MLA prefill even you manually do the projection and concatenation form compressed_kv and k_pe to produce the KV data, because there is still one slight difference that RoPE is only applied to the 64-dim portion out of the whole 192-dim. As per discussed with @yzh119 in the conversation of the #551, I think the roadmap may be, @yzh119 first use CuTe to refactor common prefill kernel, then we can implement MLA prefill kernel, and the current MHA decode kernel also needs to be updated to use tensorcore.