Closed hjq133 closed 7 months ago
Slid work !
Have you ever compared the differences between flash attention, sdpa, and eager attention? I used GRITLM to test these three attentions implementation during finetune and found that their speed and memory usage are almost the same. Are you the same?
Thanks !
Yeah, we found that SDPA with the latest torch version is the best for both speed & memory. It uses FA2.
Slid work !
Have you ever compared the differences between flash attention, sdpa, and eager attention? I used GRITLM to test these three attentions implementation during finetune and found that their speed and memory usage are almost the same. Are you the same?
Thanks !