flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
760 stars 64 forks source link

refactor: reduce the binary size of batch decode kernels #343

Closed yzh119 closed 5 days ago

yzh119 commented 5 days ago

This PR refactors the batch decode related kernels, and make the following breaking changes:

  1. remove batch_decode_with_padded_kv_cache operator, we encourage user to use BatchDecodeWithPagedKVCacheWrapper.
  2. Delete redundant DTypeQ * DTypeKV combinations, now we only support the following cases:
    1. DTypeQ == DTypeKV
    2. DTypeQ is a float16 and DTypeKV is a float8

The output data type follows the query data type.