Closed kiddyboots216 closed 1 month ago
Here is some additional information:
The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM
Moreover, below is a plot directly comparing the training loss of dense and MoE models in Megatron and GPT-NeoX trained using GBS=768, SL=2048, E=16 (total exps), K=1 (active exps). All models are trained using the same dataset and the same linear warmup+consine annealing LRS (maxLR3e-4 to minLR3e-5). We observe that the GPT-NeoX implementation has results in line with the literature (e.g., switch transformer Figure1 right), while the Megatron implementation does not.
This suggests there is a bug in Megatron-LM @jaredcasper @duncanriach @jon-barker
Here is the validation loss plot for more MoE configs, again with varying LRs that all underperform the dense model.
Here is some additional information:
The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM
Personally, I think we should change the horizontal axis to FLOPs and then compare the loss.
Here is some additional information:
The figure below shows validation loss curves for 125M MoE and dense models trained with megatron_125M_k1_e16_moe_3e-4_config.txt (maxLR \in {1e-4. 3e-4, 6e-4, 9e-4}) and megatron_dense_125M_config.txt, respectively. We observe that the MoE models underperform the dense models contrary to results form the literature. As mentioned above, this suggests that there is a bug in Megatron-LM
Personally, I think we should change the horizontal axis to FLOPs and then compare the loss.
These MoEs are all K=1, so they are already FLOPS-matched (in other words the plots would be the same if we changed the horizontal axis to FLOPS.)
Running the same config with Megatron-DeepSpeed does result in the MoE outperforming the dense model. This was run with 8 experts, topk=1 and a 125M base model.
Thank you for reporting the issue! We will investigate it and get back to you soon.
Thank you @yanring !
Hi @kiddyboots216 @bentherien ,
We have done some investigations and discovered that the issue specifically pertains to the Top-1 selection, and the root cause is the ordering of softmax and topk. In short, we should apply softmax before selecting the top-k if k equals 1, since performing softmax on [num_tokens, 1] would result in a gradient of 0. Below is our experiments and code changes:
Thanks @yanring . I did observe that with top-k =2, the results were significantly better than what the literature suggests in terms of being better than top-k 1.
Describe the regression In the forks of Megatron-LM used by gpt-neox and megatron-deepspeed, MoEs are obtaining lower loss than they are in Megatron-LM with the same configuration.
To Reproduce Attached to this issue are config files to reproduce the exact MoE we are running.
megatron_125M_k1_e16_moe_3e-4_config.sh
is the MoE config for megatron-lm,megatron_dense_125M_config.sh
is the dense config for megatron-lm,gpt-neox_e16_k1_config.yaml
is the MoE config for gpt-neox. All models are gpt-style. megatron_dense_125M_config.txt megatron_125M_k1_e16_moe_3e-4_config.txt gpt-neox_e16_k1_config.txtPrevious performance After step
12000
in gpt-neox the MoE has training loss2.452
.New performance After step
12000
in megatron-lm the MoE has training loss2.649
which is the same as the dense model.Stack trace/logs The logs are attached to this issue.
Environment (please complete the following information):
e33c8f78a35765d5aa37475a144da60e8a2349d1
Proposed fix No proposed fix.
Additional context Presumably a bug was introduced in the MoE training. However, I looked into the
gpt
code in mcore.models and was unable to find any potential causes.