Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data

Liuhong99 / Sophia

The official implementation of “Sophia: A Scalable Stochastic Second-order Optimizer for Language Model Pre-training”

MIT License

931 stars 52 forks source link

Suggested New experiments- GPT2-small w/ Sophia on Fineweb-10B data #51

Open sanyalsunny111 opened 1 month ago

sanyalsunny111 commented 1 month ago

Hi @Liuhong99 ,

I am a big fan of sophia used it cited it everytime. Just thought of suggesting you a new and less resource intensive experiment.

a) Karpathy updated the nano_gpt2 training code with tokens w/o replacement dataloader, new data finewebedu-10B. I am curious how sophia would do given this new setting.

b) inverse layer idx used here is pretty good but recently many works have used qk normalization.

Thank you for such a good repo. Feel free to disregrad this suggestion.

dhia680 commented 2 weeks ago

I trained gpt2 on finewebedu-10B with Sophia (with different settings : varying learning rates or weight decayed layers). And got this for initial results. The loss early goes up. You can see the difference with the baseline (AdamW configured as in Nanogpt repo).

Some suggested batch_size ramp up to solve this bad behaviour... I haven't try this yet. I'm open to discussions.