Closed pmpalang closed 1 year ago
If you were able to zoom in into the y-axis, you can observe that Sophia-G has 0.04 smaller validation loss than Adam. If you change the number of steps of the Adam run to 200k, you will find out Adam 200k is better than Adam 100K by 0.04.
Hello forum,
I'm trying to reproduce the GPT2 small results using the commands/configurations in the repository. However, the trends are quite different in comparison to the one in the paper/README. For example, the validation curve is expected to provide 2X faster convergence, but I see that the they are almost on top of each other at 10k iterations. The only change from the original commands is that I have --nproc_per_node=4 and batch_size=2 to fit my GPU. I'm copying the validation loss and train loss vs iterations plots here for reference. Note, that the run is still in progress but even the trends are unlike what we expect from the paper/README. I would appreciate any help in enabling me to reproduce the original plots.
Thanks, Poovaiah
Tagging another similar issue #30 here.