flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
760 stars 64 forks source link

perf: split kv-cache for prefill/append kernels #310

Closed yzh119 closed 2 weeks ago

yzh119 commented 2 weeks ago

Duplicate of #75, but re-based on the main branch.

Note that to support CUDAGraph, we cannot make kv_chunk_size a function argument, which will be passed by value, and cannot change once captured by CUDAGraph. Instead, we pass kv_chunk_size through a kv_chunk_size_ptr which is a pointer to a global memory address that stores the kv_chunk_size, its value can be set in BeginForward fuctions.