Closed logprobability closed 1 year ago
Just to update folks for posterity - the GNB estimator is implemented in the usage itself. This still leaves the question of the Hutchinson estimator, but I don't think that is a particularly huge detail.
The ordering is a little bit strange in my opinion, but you can't argue with what works. First the standard training step happens (before updating the Hessian), then the Hessian is updated by backpropagating the loss wrt the sampled targets (from the categorical distribution defined by the outputs). It seems like you might want to update the Hessian first, but if it works it works.
This still leaves question #2 - assuming the code here is what's in the paper, it is interesting that the code doesn't bias-correct.
Thanks @logprobability for the reply! The GNB estimator can be generally implemented following train_sophiag.py in the training loop. Empirically the order of hessian update and optimizer step does not matter much. Bias correction neither.
Thanks for your response, I'll close the issue.
Right now the update_hessian code in sophia.py seems to be a vanilla Adam update (without the bias correction).
Two questions: 1) Where is the update_hessian code that has the GNB estimator? 2) Do you not use the standard bias correction for EWMA in the results in the paper?