ContextualAI / gritlm

Generative Representational Instruction Tuning
https://arxiv.org/abs/2402.09906
MIT License
538 stars 39 forks source link

eager vs. flash vs. sdpa #9

Closed hjq133 closed 7 months ago

hjq133 commented 7 months ago

Slid work !

Have you ever compared the differences between flash attention, sdpa, and eager attention? I used GRITLM to test these three attentions implementation during finetune and found that their speed and memory usage are almost the same. Are you the same?

Thanks !

Muennighoff commented 7 months ago

Yeah, we found that SDPA with the latest torch version is the best for both speed & memory. It uses FA2.