Open wangyuxin87 opened 4 months ago
Hi! We just implemented FlashAttention for self-extend utilizing the window FA supported by flash_attn. In a word, we merge two FA together to get the attention of self-extend. Check https://github.com/datamllab/LongLM/pull/28 for more details! Now, this implementation, at a cost of slight increased the memory occupation and run time, can extend to 10x larger for Llama, Mistral, Gemma and Qwen1.5 in a fine-turning free way.
But still looking forward to the official implementation of such two-parts FlashAttention with a window!
https://github.com/datamllab/LongLM