[RFP] Scaling laws sweep

mgobrain commented 2 years ago

Background

This is the scaling laws parameter sweep. @StellaAthena wanted to use wandb to find optimal configurations for GPT-NeoX.

What to plot?

Parameter sweep for model size exponents from 10 to 21, starting coarsely with ~3 configurations for each model size.

Related Papers/Frameworks

Much of the discussion has taken place on the discord at #gpt-neox

@misc{kaplan2020scaling, title={Scaling Laws for Neural Language Models}, author={Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei}, year={2020}, eprint={2001.08361}, archivePrefix={arXiv}, primaryClass={cs.LG} } https://arxiv.org/abs/2001.08361

mgobrain commented 2 years ago

I tried a few sweeps at https://github.com/mgobrain/eleutherAI/tree/main/scaling_sweep sweep_GPT replicates the canonical GPT-3 flavors from the Kaplan GPT-3 paper. sweep_TG uses a config file from experiments by @TheodoreGalanos. sweep_scaling_laws attempts to look coarsely at a range of model sizes as per Kaplan's scaling laws paper.

StellaAthena commented 2 years ago

@ShivanshuPurohit is looking at integrating this with the NeoX training code

EleutherAI / project-menu