train_smol_gpt now successfully trains a 124M paramter GPT2 model.
Note: we're choosing to train the model with a chinchilla optimal number of tokens (~2.5Bn) rather than the number of tokens used in the GPT2 paper (300Bn) so it doesn't take an entire year to train.
train_smol_gpt now successfully trains a 124M paramter GPT2 model. Note: we're choosing to train the model with a chinchilla optimal number of tokens (~2.5Bn) rather than the number of tokens used in the GPT2 paper (300Bn) so it doesn't take an entire year to train.