Dao-AILab / flash-attention

Fast and memory-efficient exact attention
BSD 3-Clause "New" or "Revised" License
11.96k stars 1.07k forks source link

will the flash attention embed self-extend? #868

Open wangyuxin87 opened 4 months ago

wangyuxin87 commented 4 months ago

https://github.com/datamllab/LongLM

Mooler0410 commented 3 months ago

Hi! We just implemented FlashAttention for self-extend utilizing the window FA supported by flash_attn. In a word, we merge two FA together to get the attention of self-extend. Check https://github.com/datamllab/LongLM/pull/28 for more details! Now, this implementation, at a cost of slight increased the memory occupation and run time, can extend to 10x larger for Llama, Mistral, Gemma and Qwen1.5 in a fine-turning free way.

But still looking forward to the official implementation of such two-parts FlashAttention with a window!