bugfix: fix cudagraph-compatible prefill/decode apis

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

768 stars 64 forks source link

Closed yzh119 closed 1 month ago

yzh119 commented 1 month ago

The indptr array length should be a upper-bound of batch_size + 1 in cuda graph mode.