flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.46k stars 143 forks source link

Have any plans to optimize the decode kernel for NV-Hopper #576

Open JamesLim-sy opened 3 weeks ago

JamesLim-sy commented 3 weeks ago

I noticed hopper cluster setting may have a chance to optimize the performance of batch_decode by merging VariableLengthMergeStates with BatchDecodeWithPagedKVCacheKernel. Is there any plan to use SM90 features for it ?

zhyncs commented 3 weeks ago

Is there any plan to use SM90 features for it?

ref https://github.com/flashinfer-ai/flashinfer/pull/507#issue-2547125600

yzh119 commented 3 weeks ago

Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct? I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.

JamesLim-sy commented 2 weeks ago

Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct? I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.

@yzh119 Yes, after my profiling, time cost by merge_stats kernel occupied around 10% of whole time cost by total decode_attention operation including batch_decode and merge_stats. Also, i think introduction of cluster may shrink the memory allocation for lse in some cases.