I saw the config contains a row stating "use_transformer_engine: true", so I was wondering if you use fp8 Linear, fp8 flash attention 3 or mixed fp8/fp16 precision for training?
Furthermore, could you let us know your experience of stability of fp8 training? I thought this would be very helpful for the community to understand the whole process.
Hi
Thank you very much for the great work!
I saw the config contains a row stating "use_transformer_engine: true", so I was wondering if you use fp8 Linear, fp8 flash attention 3 or mixed fp8/fp16 precision for training?
Furthermore, could you let us know your experience of stability of fp8 training? I thought this would be very helpful for the community to understand the whole process.
Many thanks!