If I understand correctly, in the Mixtral model, layer normalization is applied after self-attention, and then these hidden states are forwarded to each expert. In the implementation of the MoE model in Megatron-LM, layer normalization is applied before each expert MLP layer. Is this correct?
If I understand correctly, in the Mixtral model, layer normalization is applied after self-attention, and then these hidden states are forwarded to each expert. In the implementation of the MoE model in Megatron-LM, layer normalization is applied before each expert MLP layer. Is this correct?