perf: slight optimization on merge states

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

760 stars 64 forks source link

perf: slight optimization on merge states #313

Open yzh119 opened 2 weeks ago

yzh119 commented 2 weeks ago

When cudagraph is enabled, we will still call merge states kernels for short sequence length, which incurs some unnecessary overhead.

This PR accelerates merge states kernel when there is nothing to merge (num_index_sets=1).

We can actually write through to the target buffer for small sequence length, but I'm always lazy evaluated and I'll leave it for a future PR (if necessary).

zhyncs commented 2 weeks ago

The commit msg is interesting :P

yzh119 commented 2 weeks ago

never mind :)