[QUESTION] Question about Mixtral compatibility with Megatron-LM core0.7.0

NVIDIA / Megatron-LM

Ongoing research training transformer models at scale

https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start

Other

9.23k stars 2.08k forks source link

[QUESTION] Question about Mixtral compatibility with Megatron-LM core0.7.0 #869

Closed wavy-jung closed 1 week ago

wavy-jung commented 2 weeks ago

If I understand correctly, in the Mixtral model, layer normalization is applied after self-attention, and then these hidden states are forwarded to each expert. In the implementation of the MoE model in Megatron-LM, layer normalization is applied before each expert MLP layer. Is this correct?