Thank you for open-sourcing such a great piece of work. Could you please elaborate on the extent to which flash attention can bring speed and memory efficiency improvements to PTv3?
Additionally, you mentioned that "FlashAttention force disables RPE and forces the accuracy reduced to fp16". Will the reduction in attention precision from fp32 to fp16 have a significant negative impact?
Dear Author,
Thank you for open-sourcing such a great piece of work. Could you please elaborate on the extent to which flash attention can bring speed and memory efficiency improvements to PTv3? Additionally, you mentioned that "FlashAttention force disables RPE and forces the accuracy reduced to fp16". Will the reduction in attention precision from fp32 to fp16 have a significant negative impact?
Thank you!