Closed brando90 closed 3 years ago
I don't know a literature of the decay rule of thumb.
I don't know a literature of the decay rule of thumb.
but the paper you are reproducing has one, rule of thumb. Is that what you are using?
See 4.5 OPTIMIZATION of https://arxiv.org/pdf/1809.10853.pdf to know an optimization method used in the paper.
“The learning rate is linearly warmed up from 10−7 to 1 for 16K steps and then annealed using a cosine learning rate schedule with C cycles (Loshchilov & Hutter, 2016). Each cycle runs for twice the number of updates than the previous cycle and we lower the maximum and minimum learning rates by a rate M compared to the previous cycle.”
Loshchilov & Hutter, 2016 https://arxiv.org/pdf/1608.03983.pdf
https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.CosineAnnealingWarmRestarts.html
I see in the original paper https://arxiv.org/abs/1910.04209 there is a rule of thumb of
which seems to be for the warm up. But what about the decay rate?
related: https://github.com/LiyuanLucasLiu/RAdam/issues/66