Closed brando90 closed 3 years ago
Optimizer has a list, param_groups. You can modify the learning rate of the optimizer.
optimizer.param_groups[0]['lr'] *= min(1.0, (step+1) / warmup_period)
The huggin face's scheduler first warmups the learning rate gradually from 0 to base_lr
, and then decays the learning rate. My approach multiplies the learning rate with the warmup factor. So, if the same decay schedule is employed, the total learning rate schedules are different.
Hi Tony-Y, So, if I understand correctly, your library applies learning rate decay starting from the initial learning rate (specified in the optimizer constructor) until reaching "warmup_period" right?
I had understood from the paper https://arxiv.org/pdf/1910.04209.pdf that it first increments the learning rate iteratively (based on "warmup_period"), to then perform the regular learning rate decay with the specified lr_scheduler.
Thank you.
If so, when do we start the learning rate decay for the Adam case? I think that it is preferred to use the same decay schedule for all cases.
Can I implement what you did here but using hugging face? What is the difference of what you did and what is given at hugging face?
https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup