Open yzh119 opened 12 months ago
Hi @yzh119, I'm wondering if the flashinfer kernel can be implemented over the vllm's paged KV cache. Does one of the item, "Support general page table layout", address such issue?
The vllm paged cache is described in https://github.com/mlc-ai/mlc-llm/blob/main/mlc_llm/relax_model/llama_batched_vllm.py#L307-L322. It's much simpler than Relax's one.
This is going to be helpful for comparing flashinfer and vllm decode attention kernels in an apple-to-apple manner. I'm also interested in batched prefill with KV cache support.
Hi @masahi , thanks for bringing this up, yes we will have a unified interface that's compatible with both vllm and the current page table design.
Batch prefill with paged kv is already supported: https://github.com/flashinfer-ai/flashinfer/blob/11364ca4c3ce651dd544efff3225906fe15c5b8a/include/flashinfer/prefill.cuh#L944-L1107
yes we will have a unified interface that's compatible with both vllm and the current page table design.
Great! cc @vinx13 @sunggg
Batch prefill with paged kv is already supported
Yes, I was aware of that and this is what interests me the most right now. I need an attention kernel that does both of the followings, and Flash attention and vllm only do one of them.
Is BatchPrefillWithPagedKVCacheKernel
supposed to be useful for speculative decoding?
I also have other use case for such kernel, in parallel sampling (generate multiple sequences for one prompt). More context in https://github.com/vllm-project/vllm/pull/12#issuecomment-1841952270
One of the use cases would be speculative decoding (maybe we need another mask
input).
And yes there are other interesting use cases, I'll showcase one of them in the next few days.
Hey, @yzh119! Great work! We are interested in using FlashInfer for the tree decoding, do you have a plan to support the custom attention mask?
Expected release date: Mar 15th, 2024
General
MLC-Serving
Atom
Required operators for paper Atom: Low-bit Quantization for Efficient and Accurate LLM Serving:
Punica
Required operators for paper Punica: Multi-Tenant LoRA Serving:
Quest
Required operators for Quest:
Other hardware backends