Flash Attention implementation is coming

We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results（on Longbench）as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.

We are still trying to figure out the reason.

datamllab / LongLM

Flash Attention implementation is coming #20