First attempt at checkpoints

danbraunai / simple_stories_train

Trains small LMs. Designed for training on SimpleStories

3 stars 1 forks source link

First attempt at checkpoints #11

Closed ThomasWMarshall closed 2 months ago

ThomasWMarshall commented 2 months ago

Description

Adds checkpoint saving.

How Has This Been Tested?

Running locally, verifying that the saved files contain the expected information.

ThomasWMarshall commented 2 months ago

Both the ApolloResearch sample repo that this repo was forked from and the starter code found in train_gpt2.py contain functions for saving parameters. If we can, I suggest that we forgo the code in train_gpt2.py and standardize on the simpler method from the ApolloResearch repo which really boils down to torch.save(model.state_dict(), model_file). This would simplify things by keeping them consistent between train_llama.py and train_gpt2.py.

danbraunai commented 2 months ago

This would simplify things by keeping them consistent between train_llama.py and train_gpt2.py.

I'm not sure we care much about supporting gpt2 for now. But yeah as to your general point, the llmc code that this script was ~copied from has the property that nearly everything is in one file. This can be nice. But yeah for our purposes I don't think this is necessary so I'm fine with starting to split things up into new files (as you have done with your utils).

ThomasWMarshall commented 2 months ago

Thanks for the review! I've done my best to address all of your comments. Please don't hesitate to reopen any of them if you'd like me to revisit.