Early loss divergence for upcycling

allenai / OLMoE

OLMoE: Open Mixture-of-Experts Language Models

https://arxiv.org/abs/2409.02060

Apache License 2.0

417 stars 32 forks source link

Early loss divergence for upcycling #15

Open yazdayy opened 1 week ago

yazdayy commented 1 week ago

Hi, thanks for the great work and sharing your wandb training logs! After analysing the plots, I have some questions regarding the upcycling experiment done for OLMoE and would greatly appreciate if you could answer them in any capacity:

I observed that the training loss for the upcycled OLMoE increased for the first 5k steps (~20B tokens) and that the training loss does not recover (to the early loss value of 2.25 at step 300) until around 120k steps. May I ask what was the peak learning rate used for training the upcycled OLMoE? And if any other experiments were to try to mitigate this early loss divergence?

Thanks!

Muennighoff commented 1 week ago

Thanks! I think you can see the peak learning rate by clicking on the model, clicking on Overview and looking at the config parameters? It should be the lr value under optimizer

yazdayy commented 1 week ago

Thanks for your reply! I have checked and it seems that the peak learning rate of the upcycled OLMoE is 4e-4, which is the same as the peak learning rate of the dense model used for upcycling (OLMo-1B).

However, this experiment setting differs from the sparse upcycling paper, which recommended to use the minimum learning rate of the dense model as the peak learning rate of the upcycled MoE:

The paper also noted that upcycling with a higher learning rate may cause instability:

May I ask has the OLMoE team conducted upcycling experiments with a similar setting to the sparse upcycling paper?

Thanks!

Muennighoff commented 1 week ago

Those comparisons may not be reliable as they use a different optimizer (adafactor) and different lr schedule (not cosine). Depends on what you mean by similar setting - the above are two key differences. Others include encoder-decoder vs decoder models, expert choice, number of experts etc

yazdayy commented 1 week ago

Sorry for the lack of clarity! More so I am interested to know if the OLMoE team has conducted upcycling experiments with lower learning rates (using the minimum learning rate of the dense model used for upcycling as the peak learning rate for training the upcycled MoE) before and was curious if you observed a different outcome in the training/performance when using lower learning rates.

Muennighoff commented 1 week ago

We didn't ablate changing the learning rate during upcycling