flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
822 stars 77 forks source link

support versatile gqa size for batch prefill #223

Closed xuzhenqi closed 2 months ago

xuzhenqi commented 2 months ago

This merge request supports versatile gqa size for batch prefill kernels. Group size 5, 6, 7, will be padded to group size 8 when loading q from global memory to shared memory, the padded groups will be discarded when writing o to global memory.

yzh119 commented 2 months ago

Hi @xuzhenqi thanks so much for doing this, I'm refactoring the code to make group size a regular function argument instead of template parameter so that we can reduce the binary size. I'll notify you when the PR is ready list you as an co-author of that PR :)

Qubitium commented 2 months ago

@yzh119 Do you have ETA on the dynamic group-size support? It may still be good to merge this PR if there no performance regressions, and then revert to the new solution when it is ready. The Yi-1.5 34B has hit the pipelines and I believe a lot more users will want to use this model even more so than Yi 1.0 34B. This PR covers that model and many others that do not fall in the statically compiled group_size slots.

yzh119 commented 1 month ago

@xuzhenqi @Qubitium , follow up on this.

We have merged #301 , and now we support any gqa group size for prefill kernels.