NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[QUESTION] Why is TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear in gpt_layer_specs #884

Open clarence-lee-sheng opened 1 week ago

clarence-lee-sheng commented 1 week ago

In the file megatron/core/models/gpt/gpt_layer_specs.py line 95, on the line "linear_fc1=TELayerNormColumnParallelLinear if use_te else ColumnParallelLinear" why is it TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear since TEColumnParallelLinear should be the direct conversion of ColumnParallelLinear

kiddyboots216 commented 3 days ago

This line https://github.com/NVIDIA/Megatron-LM/blob/e33c8f78a35765d5aa37475a144da60e8a2349d1/megatron/core/models/gpt/gpt_layer_specs.py#L45 makes it such that if you are doing a dense model, the pre_mlp_layernorm is an IdentityOp. Therefore, it has to be fused into the fc1 in the MLP, which is what TELayerNormColumnParallelLinear does. The pre_mlp_layernorm is only not an identity op if you're doing an MoE, in which case it's a FusedLayerNorm and then the MoE's GroupedMLP doesn't have a LayerNorm.

TL;DR either way there is just one LayerNorm call.