performing continuous pretraining and then finetuning causes error

richardzhuang0412 commented 1 month ago

I tried to perform finetuning using custom dataset on a model I continuously pretrained on another custom dataset but the following error occurs: <

Is there any way to streamline these two procedures?

rasbt commented 1 month ago

Thanks for reporting, and hm, yes, this is weird. I can reproduce it:

Pretraining

litgpt pretrain \
   --model_name pythia-14m \
   --tokenizer_dir checkpoints/EleutherAI/pythia-14m \
   --out_dir my_test_dir \
   --data TextFiles \
   --data.train_data_path custom_pretraining_data \
   --train.max_tokens 10_000

...
Seed set to 42
Time to instantiate model: 0.13 seconds.
Total parameters: 14,067,712
Validating ...
Measured TFLOPs: 0.10
Saving checkpoint to '/teamspace/studios/this_studio/my_test_dir/final/lit_model.pth'
Training time: 24.14s
Memory used: 1.44 GB

Continued Pretraining

litgpt pretrain \
   --model_name pythia-14m \
   --tokenizer_dir checkpoints/EleutherAI/pythia-14m \
   --out_dir my_test_dir_2 \
   --data TextFiles \
   --data.train_data_path custom_pretraining_data \
   --train.max_tokens 10_000 \
   --initial_checkpoint_dir /teamspace/studios/this_studio/my_test_dir/final/

RuntimeError: Error(s) in loading state_dict for GPT:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.norm_1.bias", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.attn.bias", "transformer.h.0.attn.proj.weight", "transformer.h.0.attn.proj.bias", "transformer.h.0.norm_2.weight", "transformer.h.0.norm_2.bias", "transformer.h.0.mlp.fc.weight",
...

ls  /teamspace/studios/this_studio/my_test_dir/final

config.json  generation_config.json  hyperparameters.yaml  lit_model.pth  model_config.yaml  tokenizer.json  tokenizer_config.json

It did work a few months ago when I tested this for the tutorials and don't have a good explanation at the moment why it would fail. Either I am doing something incorrectly above, or there could be something that has recently changed that's causing this. I will have to think more about this ...

Have you seen this before @awaelchli or @carmocca ?

Finetuning

Finetuning seems to work fine for me though

litgpt finetune full \
    --checkpoint_dir /teamspace/studios/this_studio/my_test_dir/final \
    --train.max_seq_length 64 \
    --train.max_steps 5

...
Epoch 1 | iter 73 step 4 | loss train: 10.978, val: n/a | iter time: 15.70 ms
Epoch 1 | iter 74 step 4 | loss train: 10.972, val: n/a | iter time: 15.56 ms
Epoch 1 | iter 75 step 4 | loss train: 10.967, val: n/a | iter time: 15.70 ms
Epoch 1 | iter 76 step 4 | loss train: 10.960, val: n/a | iter time: 16.08 ms
Epoch 1 | iter 77 step 4 | loss train: 10.961, val: n/a | iter time: 16.31 ms
Epoch 1 | iter 78 step 4 | loss train: 10.957, val: n/a | iter time: 16.12 ms
Epoch 1 | iter 79 step 4 | loss train: 10.944, val: n/a | iter time: 15.83 ms
Epoch 1 | iter 80 step 5 | loss train: 10.931, val: n/a | iter time: 18.52 ms (step)
Training time: 20.99s
Memory used: 0.31 GB

So, I am thinking the generated checkpoint file is fine, it's more like something when loading the checkpoint in the pretraining script.

richardzhuang0412 commented 3 weeks ago

On my end I first continuously pretrained llama-2-7b-hf and then tried finetune but it doesn't work (similar error as yours above). I didn't perform any initial pretraining though. Also it works when I directly take llama-2-7b-hf for finetune so I'm not sure if the saving checkpoint functionality in pretrain script is working properly either

richardzhuang0412 commented 3 weeks ago

I was wondering if there is a workaround to this? Like can I upload this continuously pretrained model onto HF and download it as a model from HF for finetune. Didn't find any tutorial of how to upload models trained/finetuned in Litgpt onto HF though.

Lightning-AI / litgpt