Closed AgrawalAmey closed 1 week ago
Prefill kernels use tensor cores while decode kernels use cuda cores, that's the only difference.
The prefill kernels use more registers and shared memory and registers than decode kernels, thus the number of pipeline stages will be less than decode kernels, and there is some extra overhead of loading query from shared to registers for prefill kernels (pin query to register for small query length is some optimization I should do but I haven't done unfortunately) per iteration.
But tensor cores have higher throughput, GQA has higher operational intensity than MHA so using tensor cores might be beneficial in some cases, but it suffers some overhead I mentioned before. So it's case by case.
I see, thanks a lot for the detailed comment.
An gentle reminder, v0.0.5 has some silly bugs for split-k and may result in unstable performance measurement, please check v0.0.6 instead: https://github.com/flashinfer-ai/flashinfer/releases/tag/v0.0.6
With the latest change to support kv-parallelism in prefill kernel, is there still need for a separation in prefill and decode kernels? I have been running some tests, and it looks like the prefill kernel is almost always faster at decode than the decode kernel.