The use of Transformer Engine and fp8 training

Thank you very much for the great work!

I saw the config contains a row stating "use_transformer_engine: true", so I was wondering if you use fp8 Linear, fp8 flash attention 3 or mixed fp8/fp16 precision for training?

Furthermore, could you let us know your experience of stability of fp8 training? I thought this would be very helpful for the community to understand the whole process.

Many thanks!

genmoai / models

The use of Transformer Engine and fp8 training #12