Open clarence-lee-sheng opened 1 week ago
This line https://github.com/NVIDIA/Megatron-LM/blob/e33c8f78a35765d5aa37475a144da60e8a2349d1/megatron/core/models/gpt/gpt_layer_specs.py#L45 makes it such that if you are doing a dense model, the pre_mlp_layernorm
is an IdentityOp
. Therefore, it has to be fused into the fc1 in the MLP, which is what TELayerNormColumnParallelLinear
does. The pre_mlp_layernorm
is only not an identity op if you're doing an MoE, in which case it's a FusedLayerNorm
and then the MoE's GroupedMLP doesn't have a LayerNorm.
TL;DR either way there is just one LayerNorm call.
In the file megatron/core/models/gpt/gpt_layer_specs.py line 95, on the line "linear_fc1=TELayerNormColumnParallelLinear if use_te else ColumnParallelLinear" why is it TELayerNormColumnParallelLinear used instead of TEColumnParallelLinear since TEColumnParallelLinear should be the direct conversion of ColumnParallelLinear