NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.13k stars 2.28k forks source link

No pre-norm for non-moe gpt style model when using TE-transformer layer spec??? #990

Open hityupeng opened 1 month ago

hityupeng commented 1 month ago

I was trying to run the pretrain of a non-moe gpt style model, like llama. I found it was using the TE-transformer layer spec. But in the function "get_gpt_layer_with_transformer_engine_spec" it looks like there is no "input_layernorm" setup for the pre-normalization of the transformer layer. Besides, the "pre_mlp_layernorml" is set on condition of "num_experts"??? But they are both set in the "get_gpt_layer_local_spec", and they are needed for llama... So I am confused why there is no pre-norm setup in the TE-transformer layer spec? Is this a bug or it's set somewhere else?

ethanhe42 commented 1 month ago

pre-norm is fused with fc1

https://github.com/NVIDIA/Megatron-LM/blob/203b463689bd322eb915afb3e4d1076bcc4783ba/megatron/core/models/gpt/gpt_layer_specs.py#L119C28-L119C59

hityupeng commented 1 month ago

pre-norm is fused with fc1

https://github.com/NVIDIA/Megatron-LM/blob/203b463689bd322eb915afb3e4d1076bcc4783ba/megatron/core/models/gpt/gpt_layer_specs.py#L119C28-L119C59

Got it, thanks