perf: split kv-cache for prefill/append kernels

Duplicate of #75, but re-based on the main branch.

Note that to support CUDAGraph, we cannot make kv_chunk_size a function argument, which will be passed by value, and cannot change once captured by CUDAGraph. Instead, we pass kv_chunk_size through a kv_chunk_size_ptr which is a pointer to a global memory address that stores the kv_chunk_size, its value can be set in BeginForward fuctions.

flashinfer-ai / flashinfer

perf: split kv-cache for prefill/append kernels #310