I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting?

facebookresearch / mae

PyTorch implementation of MAE https//arxiv.org/abs/2111.06377

Other

6.93k stars 1.17k forks source link

I found both LLAMA and MAE used smaller beta2 in ADAMW optimizer during pre-training. Is that any intuition behind such setting? #184

Open Novestars opened 7 months ago

alexlioralexli commented 6 months ago

AdamW divides by its estimate of the gradient's second order moment. If this is out of date, then it could lead to exploding updates (if the estimate is too small) or slow learning (if the estimate is too large). Decreasing beta2 from 0.999 to 0.95 helps address this by keeping the running estimate closer to the current value.