[QUESTION] Why is TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear in gpt_layer_specs

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

Other

9.23k stars 2.08k forks source link

This line https://github.com/NVIDIA/Megatron-LM/blob/e33c8f78a35765d5aa37475a144da60e8a2349d1/megatron/core/models/gpt/gpt_layer_specs.py#L45 makes it such that if you are doing a dense model, the pre_mlp_layernorm is an IdentityOp. Therefore, it has to be fused into the fc1 in the MLP, which is what TELayerNormColumnParallelLinear does. The pre_mlp_layernorm is only not an identity op if you're doing an MoE, in which case it's a FusedLayerNorm and then the MoE's GroupedMLP doesn't have a LayerNorm.

TL;DR either way there is just one LayerNorm call.

NVIDIA / Megatron-LM