Chunked prefill support

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

1.11k stars 100 forks source link

Chunked prefill support #392

Open Juelianqvq opened 1 month ago

Juelianqvq commented 1 month ago

Any plan on this?

yzh119 commented 1 month ago

Hi, I don't see why do we need special support for chunked prefill, the paper that proposes chunked prefill (sarashi-serve) uses flashinfer and I suppose we already support this feature?

Juelianqvq commented 1 month ago

Hi, I don't see why do we need special support for chunked prefill, the paper that proposes chunked prefill (sarashi-serve) already uses flashinfer and I suppose we already support this feature?

Thanks for your kind reply. I'm sorry I've experienced a bad misaligned result when using chunked prefill + flashinfer in vLLM? Further investigation is needed.

yzh119 commented 1 month ago

Thanks for letting me know, there might be some misconfiguration on vLLM side for chunked prefill and I'd love to help fix the issue. Can you point me to the implementation to chunked prefill in vLLM?

cc @liuxiaoxuanPKU as lily might provide some useful insights.

Juelianqvq commented 1 month ago

Thanks for letting me know, there might be some misconfiguration on vLLM side for chunked prefill and I'd love to help fix the issue. Can you point me to the implementation to chunked prefill in vLLM?

cc @LiuXiaoxuanPKU as lily might provide some useful insights.

https://github.com/vllm-project/vllm/blob/main/vllm/attention/backends/flashinfer.py#L197 & https://github.com/vllm-project/vllm/blob/main/vllm/worker/model_runner.py searching if self.backend_name == "flashinfer":

jon-chuang commented 3 weeks ago

Hello @Juelianqvq there is an ongoing effort to unify vLLM's use of flash attention as it currently calls prefill and decode kernels separately. I suspect that a similar situation is happening for flash-infer, will investigate.

https://github.com/vllm-project/vllm/pull/6052