perf: use packed bit array for attention mask

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

https://flashinfer.ai

Apache License 2.0

760 stars 64 forks source link

perf: use packed bit array for attention mask #308

Closed yzh119 closed 2 weeks ago

yzh119 commented 2 weeks ago

float attention mask consumes too much gpu memory and makes the attention kernel slow. In this pr we use 0/1 attention mask and uses bit-packed array (1 bit per element, 8 elements are packed together as uint8) to save gpu memory.