Closed agemagician closed 3 years ago
For the Noam scheduler, LR is a scalar multiplier to the LR determined by the Noam scheduler, NOT the actual LR itself.
That is why you will note that the config had high values such as 2 and 5 for the LR - it scales the Noam LR by 2x or 5x.
Aha, thanks for the explanation. In this case, we should use a much higher learning rate for fine-tuning.
Thanks again.
@titu1994 so if the pretrained model has LR 2.0, what should be the optimal value for finetuning?
We normally fine-tune with 1/5 to 1/10 the initial learning rate when the tokenizer is not changed (language and vocab remain the same) and only domain of speech has shifted.
If you are replacing the decoder, or training on another language, you should use the pretraining LR itself, and just use the loaded checkpoint as a good initialization.
We normally fine-tune with 1/5 to 1/10 the initial learning rate when the tokenizer is not changed (language and vocab remain the same) and only domain of speech has shifted.
If you are replacing the decoder, or training on another language, you should use the pretraining LR itself, and just use the loaded checkpoint as a good initialization.
Do we need to use the scheduler while training? As its never used in the ASR_CTC_Language_Finetuning notebook.
this needs to be called : asr_model.set_trainer(trainer) before data setup and optimizer setup step to use scheduler. which not the case in the notebooks.
My use case is second type where I replace the tokenizer (but same language).
The following params give me NAN loss after it reaches abt 40% of epoch 0. (no problem with data I checked) point at which it gives me nan, there rapid change in learning rate. by noam scheduler.
learning rate = 1.0 batch_size = 64 (with grad accum == 4) warm_up_steps = 10000 max_epochs = 100 total_steps > 1 lakh
But the following params hv worked fine. still using same scheduler settings.
learning rate = 0.1 batch_size = 128 (with grad accum == 4) warm_up_steps = 10000 max_epochs = 100 total_steps > 1 lakh
but i feel that the scheduler as reduced the lr to much and now its not earning. Are there any methods to verify this.
We use the default scheduler inside the ASR CTC finetuning notebook, and yes it is always required since most models will not converge otherwise. Higher batch size has more stable updates, but it's not guaranteed to not have nans. I assume you are using mixed precision - that is not advised with Conformers. High lr + fp16 will cause overflow in attention matrix and cause nan gradients.
Please move this question to it's own issue, these are not relevant to the original thread.
ok, I opened an issue here https://github.com/NVIDIA/NeMo/issues/4183
Hello,
I am fine-tuning the conformer model, and I have noticed that the defined LR in the Yaml file is not used and another lower LR is actually used during training.
The defined LR is 0.005, but the actual reported LR on wandb is around 0.00000206 . The warm-up rate is used correctly. Only the maximum LR is not correct during training.
Are you performing any kind of changes of the LR based on the number of the GPUs or the accumulate_grad_batches ?