I was trying to run the pretrain of a non-moe gpt style model, like llama. I found it was using the TE-transformer layer spec.
But in the function "get_gpt_layer_with_transformer_engine_spec" it looks like there is no "input_layernorm" setup for the pre-normalization of the transformer layer.
Besides, the "pre_mlp_layernorml" is set on condition of "num_experts"???
But they are both set in the "get_gpt_layer_local_spec", and they are needed for llama...
So I am confused why there is no pre-norm setup in the TE-transformer layer spec? Is this a bug or it's set somewhere else?
I was trying to run the pretrain of a non-moe gpt style model, like llama. I found it was using the TE-transformer layer spec. But in the function "get_gpt_layer_with_transformer_engine_spec" it looks like there is no "input_layernorm" setup for the pre-normalization of the transformer layer. Besides, the "pre_mlp_layernorml" is set on condition of "num_experts"??? But they are both set in the "get_gpt_layer_local_spec", and they are needed for llama... So I am confused why there is no pre-norm setup in the TE-transformer layer spec? Is this a bug or it's set somewhere else?