Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”
MIT License
937 stars 54 forks source link

Unable to reproduce the GPT 2 small results #43

Closed pmpalang closed 1 year ago

pmpalang commented 1 year ago

Hello forum,

I'm trying to reproduce the GPT2 small results using the commands/configurations in the repository. However, the trends are quite different in comparison to the one in the paper/README. For example, the validation curve is expected to provide 2X faster convergence, but I see that the they are almost on top of each other at 10k iterations. The only change from the original commands is that I have --nproc_per_node=4 and batch_size=2 to fit my GPU. I'm copying the validation loss and train loss vs iterations plots here for reference. Note, that the run is still in progress but even the trends are unlike what we expect from the paper/README. I would appreciate any help in enabling me to reproduce the original plots.

Thanks, Poovaiah

Tagging another similar issue #30 here. ValidationLoss Trainloss_without_smoothing Trainloss_with_smoothing

Liuhong99 commented 1 year ago

If you were able to zoom in into the y-axis, you can observe that Sophia-G has 0.04 smaller validation loss than Adam. If you change the number of steps of the Adam run to 200k, you will find out Adam 200k is better than Adam 100K by 0.04.