Resume in the pretraining code

Lightning-AI / lit-llama

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

Apache License 2.0

5.99k stars 520 forks source link

Resume in the pretraining code #359

Open LamOne1 opened 1 year ago

LamOne1 commented 1 year ago

I would like to request a new feature in the code: the ability to resume training from a checkpoint.

Currently, the code can save a checkpoint of the model's state at any point during training. However, there is no way to resume training from a checkpoint.

The code can save two things along with the model state_dict: 1)the optimizer, 2)the id of the last example it has seen (assuming the data is fed sequentially to the model not randomly)

awaelchli commented 1 year ago

For the pretraining code, I agree we will need checkpoint resume mechanisms. If anyone wants to give this a shot, here are the docs for best practices with saving/loading with Fabric: https://lightning.ai/docs/fabric/stable/guide/checkpoint.html

carmocca commented 1 year ago

Linked issue request for fine-tuning: https://github.com/Lightning-AI/lit-llama/issues/180