Closed ThomasWMarshall closed 2 months ago
Both the ApolloResearch sample repo that this repo was forked from and the starter code found in train_gpt2.py
contain functions for saving parameters. If we can, I suggest that we forgo the code in train_gpt2.py
and standardize on the simpler method from the ApolloResearch repo which really boils down to torch.save(model.state_dict(), model_file)
. This would simplify things by keeping them consistent between train_llama.py
and train_gpt2.py
.
This would simplify things by keeping them consistent between train_llama.py and train_gpt2.py.
I'm not sure we care much about supporting gpt2 for now. But yeah as to your general point, the llmc code that this script was ~copied from has the property that nearly everything is in one file. This can be nice. But yeah for our purposes I don't think this is necessary so I'm fine with starting to split things up into new files (as you have done with your utils).
Thanks for the review! I've done my best to address all of your comments. Please don't hesitate to reopen any of them if you'd like me to revisit.
Description
Adds checkpoint saving.
How Has This Been Tested?
Running locally, verifying that the saved files contain the expected information.