Open yazdayy opened 1 week ago
Thanks! I think you can see the peak learning rate by clicking on the model, clicking on Overview and looking at the config parameters? It should be the lr value under optimizer
Thanks for your reply! I have checked and it seems that the peak learning rate of the upcycled OLMoE is 4e-4, which is the same as the peak learning rate of the dense model used for upcycling (OLMo-1B).
However, this experiment setting differs from the sparse upcycling paper, which recommended to use the minimum learning rate of the dense model as the peak learning rate of the upcycled MoE:
The paper also noted that upcycling with a higher learning rate may cause instability:
May I ask has the OLMoE team conducted upcycling experiments with a similar setting to the sparse upcycling paper?
Thanks!
Those comparisons may not be reliable as they use a different optimizer (adafactor) and different lr schedule (not cosine). Depends on what you mean by similar setting - the above are two key differences. Others include encoder-decoder vs decoder models, expert choice, number of experts etc
Sorry for the lack of clarity! More so I am interested to know if the OLMoE team has conducted upcycling experiments with lower learning rates (using the minimum learning rate of the dense model used for upcycling as the peak learning rate for training the upcycled MoE) before and was curious if you observed a different outcome in the training/performance when using lower learning rates.
We didn't ablate changing the learning rate during upcycling
Hi, thanks for the great work and sharing your wandb training logs! After analysing the plots, I have some questions regarding the upcycling experiment done for OLMoE and would greatly appreciate if you could answer them in any capacity:
I observed that the training loss for the upcycled OLMoE increased for the first 5k steps (~20B tokens) and that the training loss does not recover (to the early loss value of 2.25 at step 300) until around 120k steps. May I ask what was the peak learning rate used for training the upcycled OLMoE? And if any other experiments were to try to mitigate this early loss divergence?
Thanks!