flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving
https://flashinfer.ai
Apache License 2.0
768 stars 64 forks source link

feat: support custom attention mask in prefill/append attention kernels #266

Closed yzh119 closed 1 month ago

yzh119 commented 1 month ago

Some speculative decoding algorithms requires tree attention, which could be supported via prefill/append attention kernels with custom attention mask.

This PR supports this feature.

Related issues: #152

API Breaking Changes

The begin_forward function in BatchPrefillWithPagedKVCacheWrapper now has an additional argument page_size to accomodate this new feature.