refactor: reduce the binary size of batch decode kernels - Githubissues

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

760 stars 64 forks source link

refactor: reduce the binary size of batch decode kernels #343

Closed yzh119 closed 5 days ago

yzh119 commented 5 days ago

This PR refactors the batch decode related kernels, and make the following breaking changes:

remove batch_decode_with_padded_kv_cache operator, we encourage user to use BatchDecodeWithPagedKVCacheWrapper.
Delete redundant DTypeQ * DTypeKV combinations, now we only support the following cases:
1. DTypeQ == DTypeKV
2. DTypeQ is a float16 and DTypeKV is a float8

The output data type follows the query data type.