What is the decay rule of thumb?

brando90 commented 3 years ago

I see in the original paper https://arxiv.org/abs/1910.04209 there is a rule of thumb of

2 * (1 - beta_2)^-1

which seems to be for the warm up. But what about the decay rate?

Tony-Y commented 3 years ago

I don't know a literature of the decay rule of thumb.

brando90 commented 3 years ago

I don't know a literature of the decay rule of thumb.

but the paper you are reproducing has one, rule of thumb. Is that what you are using?

Tony-Y commented 3 years ago

See 4.5 OPTIMIZATION of https://arxiv.org/pdf/1809.10853.pdf to know an optimization method used in the paper.

“The learning rate is linearly warmed up from 10−7 to 1 for 16K steps and then annealed using a cosine learning rate schedule with C cycles (Loshchilov & Hutter, 2016). Each cycle runs for twice the number of updates than the previous cycle and we lower the maximum and minimum learning rates by a rate M compared to the previous cycle.”

Loshchilov & Hutter, 2016 https://arxiv.org/pdf/1608.03983.pdf

https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html

Tony-Y / pytorch_warmup

What is the decay rule of thumb? #4