Open JamesLim-sy opened 3 weeks ago
Is there any plan to use SM90 features for it?
ref https://github.com/flashinfer-ai/flashinfer/pull/507#issue-2547125600
Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct? I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.
Hi @JamesLim-sy , if I understand it correctly, I think what you mean is that using some SM for decode and some other SM within the same cluster for merge states to use distributed shared memory, is that correct? I think it's doable, but after the landing of new scheduler, the number of states to be merged can be further reduced so I'm not clear how much advantage this optimization could bring.
@yzh119 Yes, after my profiling, time cost by merge_stats
kernel occupied around 10% of whole time cost by total decode_attention operation including batch_decode
and merge_stats
. Also, i think introduction of cluster may shrink the memory allocation for lse
in some cases.
I noticed hopper cluster setting may have a chance to optimize the performance of batch_decode by merging
VariableLengthMergeStates
withBatchDecodeWithPagedKVCacheKernel
. Is there any plan to use SM90 features for it ?