Presently, the sequence lengths from qo_indptr determine kernel parameters which are supposed to be frozen for CUDA graph.
This even includes split_kv.
The desire is to fix the batch size (and implicitly the number of elements in qo_indptr), while varying the actual sequence among a fixed number tokens that qo_indptr points to. For example, the same CUDA graph should be able to process two prefill requests with a total a sum 2048 and another set of prefill requests whose lengths sum up to anything less than 2048.
When CUDA graphs are enabled, these parameters should be hooked to an upper bound and the actual values should be passed on dynamically.
Presently, the sequence lengths from
qo_indptr
determine kernel parameters which are supposed to be frozen for CUDA graph. This even includessplit_kv
.In particular,
total_num_tiles_q
depends on the contents ofqo_indptr
, not only its shape: https://github.com/flashinfer-ai/flashinfer/blob/9cba9fbd571f217161e8fb0f6b3aa2feabbb8b12/include/flashinfer/attention/scheduler.cuh#L475.The desire is to fix the batch size (and implicitly the number of elements in
qo_indptr
), while varying the actual sequence among a fixed number tokens thatqo_indptr
points to. For example, the same CUDA graph should be able to process two prefill requests with a total a sum 2048 and another set of prefill requests whose lengths sum up to anything less than 2048.When CUDA graphs are enabled, these parameters should be hooked to an upper bound and the actual values should be passed on dynamically.