Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
13.7k stars 1.26k forks source link

Support for FLASH: Gated Attention Unit #98

Closed mistycube closed 1 year ago

mistycube commented 1 year ago

Is it possible in the flash attention interface to handle gated single head attention ? Maybe the speedup can be even higher.

Paper: https://arxiv.org/pdf/2202.10447.pdf

tridao commented 1 year ago

We don't have that out of the box. Feel free to play with the Triton implementation (it's a self-contained Python file).

mistycube commented 1 year ago

Sure. Thanks for your prompt reply.