BUTSpeechFIT / EEND

70 stars 10 forks source link

Question About How to Adjust Noam Scheduler Varied with Batchsize #8

Closed SoundingSilence closed 9 months ago

SoundingSilence commented 11 months ago

Thank you for your pytorch implementation of EDA-EEND. In default setting, train_batchsize=32, noam_warmup_steps=200000. However, if I want to increase batchsize, like train_batchsize=256, then what value of noam_warmup_steps should be set? In noam scheduler, the peak LR is depended on d_model and warmup_steps. Even though train_batchsize is increased, the lr scheduler plot should be consistent according to running epochs.

Furthermore, if I want to utilize DDP framework, how noam scheduler should be adjusted varied with batchsize?

I am looking forward to your reply. Thanks!

fnlandini commented 11 months ago

Hi @SoundingSilence sorry for the delay. You are right. If you want to keep the same maximum learning rate, you'd need to update the model size in Noam. I have not explored such high batch sizes so I am not sure about the behavior with this model. Theoretically, if I'm not mistaken, in order to keep the same maximum learning rate, if you increase the batchsize (or effective batchsize, in case of DDP) by 8, you should reduce the warmup steps to one eight and increase the noam model size 8 times to replicate the same behavior. But even if so, the training might not be exactly the same. My suggestion is that you try this configuration first but do not discard playing with some similar configurations. For example, a bit higher learning rate might work the same or better. Or running the warm up for longer or shorter might work similarly.

I hope this helps.

SoundingSilence commented 10 months ago

Thanks for your reply. During my experiments, I find that EEND-based models are really sensitive to the training config hyperparameters (such as epoch, warmp_up steps, learning rate, schduler ..) and the final loss varies. What is your take on this?

fnlandini commented 10 months ago

Yes, the training hyperparameters can have quite some influence. However, I think that applies to almost any NN-based model nowadays. I am afraid that I do not have a perfect answer on what values to choose if you change some of them other than what I wrote above. In practice, choosing a very good configuration still involves some trial and error...

fnlandini commented 9 months ago

Closing due to inactivity