Support for FLASH: Gated Attention Unit

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

BSD 3-Clause "New" or "Revised" License

13.7k stars 1.26k forks source link

Closed mistycube closed 1 year ago

mistycube commented 1 year ago

Is it possible in the flash attention interface to handle gated single head attention ? Maybe the speedup can be even higher.

tridao commented 1 year ago

We don't have that out of the box. Feel free to play with the Triton implementation (it's a self-contained Python file).

mistycube commented 1 year ago

Sure. Thanks for your prompt reply.