Closed Btlmd closed 3 weeks ago
MoE requires some small changes to TE (see this PR), which is only available since TE v1.7+. We have a plan to adopt TE's linear layer when enabling FP8 for MoE training. Stay tuned. For BF16, we need more comprehensive convergence tests before switching to TE's linear layer.
Thanks a lot 😀
I noticed that TransformerEngine implementation is not used when building an MoE layer even
use_te
is specified.https://github.com/NVIDIA/Megatron-LM/blob/a5534c8f3e2c49ad8ce486f5cba3408e14f5fcc2/megatron/core/models/gpt/gpt_layer_specs.py#L101-L106
I wonder the reason for not using TE implementation.