Can BatchDecodeWithPaddedKVCache be used in cascade inference?

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

822 stars 77 forks source link

Can BatchDecodeWithPaddedKVCache be used in cascade inference? #250

Open joey12300 opened 1 month ago

joey12300 commented 1 month ago

Currently the implementation of cascade inference is two level, composed of SinglePrefillWithKVCache(or SingleDecodeWithKVCache) and BatchPrefillWithPagedKVCacheWrapper (or BatchDecodeWithPagedKVCacheWrapper). I wonder if the batch prefill or decode attention without paged attention can be used in cascade inference?

yzh119 commented 1 month ago

Yes I do think so, there are some test cases https://github.com/flashinfer-ai/flashinfer/blob/7e9cc7ff42ca283c317061a877305d09a395fad2/python/tests/test_shared_prefix_kernels.py#L33-L53