Open yzh119 opened 6 months ago
Hello @yzh119, I know about this, it involves merging the latent proj_in and proj_out into the Q/KV proj_in/out
Maybe I will try to take a look this weekend.
Multi-Head Latency Attention
Btw it is Multi-Head Latent Attention
Thank you @jon-chuang !
Any updates for deepseek v2?
Is MLA supported now? If it is supported, could you point out how to use it?
Is MLA supported now? If it is supported, could you point out how to use it?
Hi, #551 is the first step to support MLA. The MLA prefill is still need some time.
@tsu-bin As you say #551 MLA decode is supported, but MLA prefill is not yet supported; is there a way to use MHA prefill combined with MLA decode to run Deepseek-v2 ? Or any method that can enable the current MLA decode to be used in Deepseek-v2?
hi @liangzelang I'm afraid you still can't use MHA prefill kernel to support MLA prefill even you manually do the projection and concatenation form compressed_kv and k_pe to produce the KV data, because there is still one slight difference that RoPE is only applied to the 64-dim portion out of the whole 192-dim. As per discussed with @yzh119 in the conversation of the #551, I think the roadmap may be, @yzh119 first use CuTe to refactor common prefill kernel, then we can implement MLA prefill kernel, and the current MHA decode kernel also needs to be updated to use tensorcore.
MLA(Multi-Head Latency Attention) was proposed in DeepSeek-v2 for efficient inference.