[QUESTION] Why TE is not used for an MoE layer?

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.23k stars 2.08k forks source link

[QUESTION] Why TE is not used for an MoE layer? #850

Closed Btlmd closed 3 weeks ago

Btlmd commented 1 month ago

I noticed that TransformerEngine implementation is not used when building an MoE layer even use_te is specified.

https://github.com/NVIDIA/Megatron-LM/blob/a5534c8f3e2c49ad8ce486f5cba3408e14f5fcc2/megatron/core/models/gpt/gpt_layer_specs.py#L101-L106

I wonder the reason for not using TE implementation.

yaox12 commented 3 weeks ago

MoE requires some small changes to TE (see this PR), which is only available since TE v1.7+. We have a plan to adopt TE's linear layer when enabling FP8 for MoE training. Stay tuned. For BF16, we need more comprehensive convergence tests before switching to TE's linear layer.

Btlmd commented 3 weeks ago

Thanks a lot 😀