Closed mgobrain closed 2 years ago
I tried a few sweeps at https://github.com/mgobrain/eleutherAI/tree/main/scaling_sweep
sweep_GPT
replicates the canonical GPT-3 flavors from the Kaplan GPT-3 paper.
sweep_TG
uses a config file from experiments by @TheodoreGalanos.
sweep_scaling_laws
attempts to look coarsely at a range of model sizes as per Kaplan's scaling laws paper.
@ShivanshuPurohit is looking at integrating this with the NeoX training code
Background
This is the scaling laws parameter sweep. @StellaAthena wanted to use wandb to find optimal configurations for GPT-NeoX.
What to plot?
Parameter sweep for model size exponents from 10 to 21, starting coarsely with ~3 configurations for each model size.
Related Papers/Frameworks
Much of the discussion has taken place on the discord at #gpt-neox
@misc{kaplan2020scaling, title={Scaling Laws for Neural Language Models}, author={Jared Kaplan and Sam McCandlish and Tom Henighan and Tom B. Brown and Benjamin Chess and Rewon Child and Scott Gray and Alec Radford and Jeffrey Wu and Dario Amodei}, year={2020}, eprint={2001.08361}, archivePrefix={arXiv}, primaryClass={cs.LG} } https://arxiv.org/abs/2001.08361