will the flash attention embed self-extend?

Hi! We just implemented FlashAttention for self-extend utilizing the window FA supported by flash_attn. In a word, we merge two FA together to get the attention of self-extend. Check https://github.com/datamllab/LongLM/pull/28 for more details! Now, this implementation, at a cost of slight increased the memory occupation and run time, can extend to 10x larger for Llama, Mistral, Gemma and Qwen1.5 in a fine-turning free way.

But still looking forward to the official implementation of such two-parts FlashAttention with a window!

Dao-AILab / flash-attention

will the flash attention embed self-extend? #868