flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
760 stars 64 forks source link

perf: use cub's native BlockLoad/BlockStore for sampling kernels #309

Open yzh119 opened 2 weeks ago

yzh119 commented 2 weeks ago

Faster for odd hidden dimensions. Slower for hidden dimension divisible by 4.

Maybe we should use a mixture of BlockLoad/BlockStore and current solution.