flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
1.46k stars 143 forks source link

CUDA Graph support for prefill kernels with varying `qo_indptr` #626

Open nandor opened 2 days ago

nandor commented 2 days ago

Presently, the sequence lengths from qo_indptr determine kernel parameters which are supposed to be frozen for CUDA graph. This even includes split_kv.

In particular, total_num_tiles_q depends on the contents of qo_indptr, not only its shape: https://github.com/flashinfer-ai/flashinfer/blob/9cba9fbd571f217161e8fb0f6b3aa2feabbb8b12/include/flashinfer/attention/scheduler.cuh#L475.

The desire is to fix the batch size (and implicitly the number of elements in qo_indptr), while varying the actual sequence among a fixed number tokens that qo_indptr points to. For example, the same CUDA graph should be able to process two prefill requests with a total a sum 2048 and another set of prefill requests whose lengths sum up to anything less than 2048.

When CUDA graphs are enabled, these parameters should be hooked to an upper bound and the actual values should be passed on dynamically.