Closed thegodone closed 1 year ago
Thanks for reaching out. I haven't observed this and I'm wondering whether you can provide a simple setup to reproduce this phenomenon.
BTW, there is a known issue that can be fixed by setting degenerated_to_sgd=False
(more discussions can be found at: https://github.com/LiyuanLucasLiu/RAdam/issues/54)
I have run into the same issue, trying to implement RAdam. Here's a pure (num)python implementation:
class RADAM:
def __init__(self, fg, x0, alpha, beta1=0.9, beta2=0.999):
"""Create a new RADAM optimizer.
Parameters
----------
fg : callable
a function which returns (f, g) where f is the scalar cost, and
g is the vector gradient.
x0 : callable
the parameter vector immediately prior to optimization
alpha : float
the step size
beta1 : float
the decay rate of the first moment (mean of gradient)
beta2 : float
the decay rate of the second moment (uncentered variance)
"""
self.fg = fg
self.x0 = x0
self.alpha = alpha
self.beta1 = beta1
self.beta2 = beta2
self.x = x0.copy()
self.m = np.zeros_like(x0)
self.v = np.zeros_like(x0)
self.eps = np.finfo(x0.dtype).eps
self.rhoinf = 2 / (1-beta2) - 1
self.iter = 0
def step(self):
"""Perform one iteration of optimization."""
self.iter += 1
k = self.iter
beta1 = self.beta1
beta2 = self.beta2
beta2k = beta2**k
f, g = self.fg(self.x)
# update momentum estimates
self.m = beta1*self.m + (1-beta1) * g
self.v = beta2*self.v + (1-beta2) * (g*g)
# torch exp_avg_sq.mul_(beta2).addcmul_(grad,grad,value=1-beta2)
# == v
mhat = self.m / (1 - beta1**k)
# going to use this many times, local lookup is cheaper
rhoinf = self.rhoinf
rho = rhoinf - (2*k*beta2k)/(1-beta2k)
x = self.x
if rho >= 5: # 5 was 4 in the paper, but PyTorch uses 5, most others too
# l = np.sqrt((1-beta2k)/self.v) # NOQA
# commented out l exactly as in paper
# seems to blow up all the time, must be a typo; missing sqrt(v)
# torch computes vhat same as ADAM, assume that's the typo
l = np.sqrt(1 - beta2k) / (np.sqrt(self.v)+self.eps) # NOQA
num = (rho - 4) * (rho - 2) * rhoinf
den = (rhoinf - 4) * (rhoinf - 2) * rho
r = np.sqrt(num/den)
self.x = x - self.alpha * r * mhat * l
else:
self.x = x - self.alpha * mhat
return x, f, g
def runN(optimizer, N):
for _ in range(N):
yield optimizer.step()
A minimum working example that blows up,
import numpy as np
from scipy.optimize import rosen, rosen_der
def fg(x):
f = rosen(x)
g = rosen_der(x)
return f,g
x0 = np.zeros(2)
x0[0]=-2
x0[1]=2
opt = RADAM(fg, x0, 1e-2)
hist = []
xh = []
for xk, fk, gk in runN(opt,1000):
hist.append(float(fk))
xh.append(xk.copy())
I do not observe this behavior with vanilla Adam, Yogi, Adagrad, RMSprop, or other optimizers. Any thoughts? @LiyuanLucasLiu
@brandondube thanks for providing the example.
I believe this is a known issue and can be fixed by setting degenerated_to_sgd=False
(in your case, you can simply delete the else: self.x = x - self.alpha * mhat
part).
More discussions can be found at: https://github.com/LiyuanLucasLiu/RAdam/issues/54#issuecomment-703283972.
Thanks, that was it. II made a different choice, detuning g by its norm. This increases the range of stable learning rates, although not all that much.
invgnorm = 1 / np.sqrt(gsq.sum())
self.x = x - self.alpha * invgnorm * g
I observed that the RAdam method can start at first epochs to be produce NaN Loss while Adams not. It's not only for one or two experiments but a general observation. I wonder if we can merge Adabound clamp to RAdam to avoid this type of issue in the very beginning of the training ?