difference of this library with hugging face

Tony-Y / pytorch_warmup

Learning Rate Warmup in PyTorch

https://tony-y.github.io/pytorch_warmup/

MIT License

386 stars 25 forks source link

difference of this library with hugging face #3

Closed brando90 closed 3 years ago

brando90 commented 3 years ago

Can I implement what you did here but using hugging face? What is the difference of what you did and what is given at hugging face?

https://huggingface.co/transformers/main_classes/optimizer_schedules.html?highlight=cosine#transformers.get_cosine_schedule_with_warmup

Tony-Y commented 3 years ago

Optimizer has a list, param_groups. You can modify the learning rate of the optimizer.

optimizer.param_groups[0]['lr'] *= min(1.0, (step+1) / warmup_period)

The huggin face's scheduler first warmups the learning rate gradually from 0 to base_lr, and then decays the learning rate. My approach multiplies the learning rate with the warmup factor. So, if the same decay schedule is employed, the total learning rate schedules are different.

DanielFPerez commented 2 years ago

Hi Tony-Y, So, if I understand correctly, your library applies learning rate decay starting from the initial learning rate (specified in the optimizer constructor) until reaching "warmup_period" right?

I had understood from the paper https://arxiv.org/pdf/1910.04209.pdf that it first increments the learning rate iteratively (based on "warmup_period"), to then perform the regular learning rate decay with the specified lr_scheduler.

Thank you.

Tony-Y commented 2 years ago

If so, when do we start the learning rate decay for the Adam case? I think that it is preferred to use the same decay schedule for all cases.