We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results(on Longbench)as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.
We've already had the implementation. Actually,the new results we released on X (previous Twitter) with Google's Gemma are based on this implementation (otherwise we cannot do it on sequences > 30k). However, with the current implementation, we cannot reach the same results(on Longbench)as what we reported in the paper (based on the no flash attention version). There is a minor performance gap between the two versions.
We are still trying to figure out the reason.