Open LamOne1 opened 1 year ago
For the pretraining code, I agree we will need checkpoint resume mechanisms. If anyone wants to give this a shot, here are the docs for best practices with saving/loading with Fabric: https://lightning.ai/docs/fabric/stable/guide/checkpoint.html
Linked issue request for fine-tuning: https://github.com/Lightning-AI/lit-llama/issues/180
I would like to request a new feature in the code: the ability to resume training from a checkpoint.
Currently, the code can save a checkpoint of the model's state at any point during training. However, there is no way to resume training from a checkpoint.
The code can save two things along with the model state_dict: 1)the optimizer, 2)the id of the last example it has seen (assuming the data is fed sequentially to the model not randomly)