Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
MIT License
938 stars 52 forks source link

Where is the implementation of the estimators? #22

Closed logprobability closed 1 year ago

logprobability commented 1 year ago

Right now the update_hessian code in sophia.py seems to be a vanilla Adam update (without the bias correction).

Two questions: 1) Where is the update_hessian code that has the GNB estimator? 2) Do you not use the standard bias correction for EWMA in the results in the paper?

logprobability commented 1 year ago

Just to update folks for posterity - the GNB estimator is implemented in the usage itself. This still leaves the question of the Hutchinson estimator, but I don't think that is a particularly huge detail.

The ordering is a little bit strange in my opinion, but you can't argue with what works. First the standard training step happens (before updating the Hessian), then the Hessian is updated by backpropagating the loss wrt the sampled targets (from the categorical distribution defined by the outputs). It seems like you might want to update the Hessian first, but if it works it works.

This still leaves question #2 - assuming the code here is what's in the paper, it is interesting that the code doesn't bias-correct.

Liuhong99 commented 1 year ago

Thanks @logprobability for the reply! The GNB estimator can be generally implemented following train_sophiag.py in the training loop. Empirically the order of hessian update and optimizer step does not matter much. Bias correction neither.

logprobability commented 1 year ago

Thanks for your response, I'll close the issue.