flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.11k stars 100 forks source link

[FEAT REQ][CUDA GRAPH] Allow explicit control flag to force enable/disable split KV #397

Open AgrawalAmey opened 1 month ago

AgrawalAmey commented 1 month ago

Hello @yzh119,

Currently, we are using two independent API calls for prefill and decode in a mixed batch setting. This makes defining a cuda graph layout considerably harder. Ideally, if we could do both prefill and decode attention computation in prefill kernel it would considerably simplify the cuda graph layout. However, the main barrier for doing this right now is that we don't have an explicit control over when to use split-KV. In case of mixed batches, it appears that doing split-KV is beneficial in most cases. But it appears that split-KV gets disabled in certain batch composition, which significantly hurts latency. Would it be possible to add an optional override knob for this? Thanks!

yzh119 commented 1 month ago

Hi @AgrawalAmey , actually I found our scheduler could be further optimized so that there will be no wave quantization and I'm working on a refactor on that. After this change, I suppose the split-KV will always be enabled.

AgrawalAmey commented 1 month ago

Oh that is great! Looking forward to it, thank you @yzh119!