Allow the cascade kernels to be executed using varying sequence lenghts

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

1.48k stars 147 forks source link

Closed nandor closed 4 days ago

nandor commented 6 days ago

The cascade kernels can take a dynamic sequence length in order to allow the number of tokens to vary when executed under CUDA graphs.

This is the first step towards implementing CUDA graph support for arbitrary qo_indptr contents, as tracked by #626.