Hessian-vector product vs. Hessian estimator

Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”

MIT License

938 stars 52 forks source link

Hessian-vector product vs. Hessian estimator #23

Closed zhouyuan closed 1 year ago

zhouyuan commented 1 year ago

Hi, @Liuhong99

(Sorry for this firing this issue, it's more like a question on the detail impl.) Sophia is now using the hessian estimator(Hutchinson or Gauss-Newton-Bartlett) to do the pre-condition, the paper also mentioned Sophia can use the HVP in PyTorch to do the same thing. Have you also implemented the latter approach? I'm wondering how is the performance gap between these two approaches.

Thank you for sharing the code, very helpful for me to understand the paper.

Cheers, -yuan

Liuhong99 commented 1 year ago

Hi @zhouyuan , Thanks for your interest! Both Sophia-G and Sophia-H use hessian estimators. Hutchison's estimator relies on HVP, while GNB does not. In this sense, Sophia-G is more promising because it's easier to implement. From the current experiments, GNB is better than Hutchinson.