When cudagraph is enabled, we will still call merge states kernels for short sequence length, which incurs some unnecessary overhead.
This PR accelerates merge states kernel when there is nothing to merge (num_index_sets=1).
We can actually write through to the target buffer for small sequence length, but I'm always lazy evaluated and I'll leave it for a future PR (if necessary).
When cudagraph is enabled, we will still call merge states kernels for short sequence length, which incurs some unnecessary overhead.
This PR accelerates merge states kernel when there is nothing to merge (
num_index_sets=1
).We can actually write through to the target buffer for small sequence length, but I'm always lazy evaluated and I'll leave it for a future PR (if necessary).