Open richardzhuang0412 opened 1 month ago
Thanks for reporting, and hm, yes, this is weird. I can reproduce it:
litgpt pretrain \
--model_name pythia-14m \
--tokenizer_dir checkpoints/EleutherAI/pythia-14m \
--out_dir my_test_dir \
--data TextFiles \
--data.train_data_path custom_pretraining_data \
--train.max_tokens 10_000
...
Seed set to 42
Time to instantiate model: 0.13 seconds.
Total parameters: 14,067,712
Validating ...
Measured TFLOPs: 0.10
Saving checkpoint to '/teamspace/studios/this_studio/my_test_dir/final/lit_model.pth'
Training time: 24.14s
Memory used: 1.44 GB
litgpt pretrain \
--model_name pythia-14m \
--tokenizer_dir checkpoints/EleutherAI/pythia-14m \
--out_dir my_test_dir_2 \
--data TextFiles \
--data.train_data_path custom_pretraining_data \
--train.max_tokens 10_000 \
--initial_checkpoint_dir /teamspace/studios/this_studio/my_test_dir/final/
RuntimeError: Error(s) in loading state_dict for GPT:
Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.norm_1.bias", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.attn.bias", "transformer.h.0.attn.proj.weight", "transformer.h.0.attn.proj.bias", "transformer.h.0.norm_2.weight", "transformer.h.0.norm_2.bias", "transformer.h.0.mlp.fc.weight",
...
ls /teamspace/studios/this_studio/my_test_dir/final
config.json generation_config.json hyperparameters.yaml lit_model.pth model_config.yaml tokenizer.json tokenizer_config.json
It did work a few months ago when I tested this for the tutorials and don't have a good explanation at the moment why it would fail. Either I am doing something incorrectly above, or there could be something that has recently changed that's causing this. I will have to think more about this ...
Have you seen this before @awaelchli or @carmocca ?
Finetuning seems to work fine for me though
litgpt finetune full \
--checkpoint_dir /teamspace/studios/this_studio/my_test_dir/final \
--train.max_seq_length 64 \
--train.max_steps 5
...
Epoch 1 | iter 73 step 4 | loss train: 10.978, val: n/a | iter time: 15.70 ms
Epoch 1 | iter 74 step 4 | loss train: 10.972, val: n/a | iter time: 15.56 ms
Epoch 1 | iter 75 step 4 | loss train: 10.967, val: n/a | iter time: 15.70 ms
Epoch 1 | iter 76 step 4 | loss train: 10.960, val: n/a | iter time: 16.08 ms
Epoch 1 | iter 77 step 4 | loss train: 10.961, val: n/a | iter time: 16.31 ms
Epoch 1 | iter 78 step 4 | loss train: 10.957, val: n/a | iter time: 16.12 ms
Epoch 1 | iter 79 step 4 | loss train: 10.944, val: n/a | iter time: 15.83 ms
Epoch 1 | iter 80 step 5 | loss train: 10.931, val: n/a | iter time: 18.52 ms (step)
Training time: 20.99s
Memory used: 0.31 GB
So, I am thinking the generated checkpoint file is fine, it's more like something when loading the checkpoint in the pretraining script.
On my end I first continuously pretrained llama-2-7b-hf and then tried finetune but it doesn't work (similar error as yours above). I didn't perform any initial pretraining though. Also it works when I directly take llama-2-7b-hf for finetune so I'm not sure if the saving checkpoint functionality in pretrain script is working properly either
I was wondering if there is a workaround to this? Like can I upload this continuously pretrained model onto HF and download it as a model from HF for finetune. Didn't find any tutorial of how to upload models trained/finetuned in Litgpt onto HF though.
I tried to perform finetuning using custom dataset on a model I continuously pretrained on another custom dataset but the following error occurs: <
Is there any way to streamline these two procedures?