Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
MIT License
937 stars 54 forks source link

Bug in the per-coordinate clipping? #34

Closed vmarkovtsev closed 1 year ago

vmarkovtsev commented 1 year ago

The current code for per-coordinate clipping is this:

ratio = (exp_avg.abs() / (rho * bs * hess + 1e-15)).clamp(None,1)
param.addcmul_(exp_avg.sign(), ratio, value=step_size_neg)

Let's suppose that hess carries negative values below -1e-15 / (rho * bs). In that case, exp_avg.abs() / (rho * bs * hess + 1e-15) evaluates to a negative number, potentially a very big one. .clamp(None,1) does not truncate the negative numbers, so they survive. exp_avg.sign() * ratio * step_size_neg becomes either an unclipped positive (if exp_avg is negative) or an unclipped negative.

In my experiments, hess has always been positive so this problem has never appeared. The paper, however, says that:

When any entry of ht is negative, ... = ηρ·sign(mt[i]), which is the same as stochastic momentum SignSGD.

So the negative hessian values should be clipped. WDYT?

Liuhong99 commented 1 year ago

hess cannot be negative in SophiaG since it is the diagonal entries of the Gauss-Newton matrix, which has to be non-negative. In the paper,

When any entry of ht is negative, ... = ηρ·sign(mt[i]), which is the same as stochastic momentum SignSGD.

deals with the potential negative entries of hess for SophiaH. The Hessian estimator in SophiaH is estimating the diagonal of the actual Hessian matrix so it can contain some negative values.

vmarkovtsev commented 1 year ago

Thanks for the explanation!