AdamW divides by its estimate of the gradient's second order moment. If this is out of date, then it could lead to exploding updates (if the estimate is too small) or slow learning (if the estimate is too large). Decreasing beta2 from 0.999 to 0.95 helps address this by keeping the running estimate closer to the current value.
AdamW divides by its estimate of the gradient's second order moment. If this is out of date, then it could lead to exploding updates (if the estimate is too small) or slow learning (if the estimate is too large). Decreasing beta2 from 0.999 to 0.95 helps address this by keeping the running estimate closer to the current value.