Open wodelt opened 5 days ago
I've faced similar issues, usually I convert my models to HF format for some other parts of my pipeline, and converting back from HF to LitGPT resolves this error.
Alternatively, https://github.com/Lightning-AI/litgpt/blob/main/litgpt/scripts/convert_pretrained_checkpoint.py
seems to be also meant for this purpose. Perhaps you can try that while the maintainers reply with a more concrete solution!
Download some data
mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt
Download tokenizer
litgpt download EleutherAI/pythia-160m \
--tokenizer_only True
Pretrain model
litgpt pretrain EleutherAI/pythia-160m \
--tokenizer_dir EleutherAI/pythia-160m \
--data TextFiles \
--data.train_data_path "custom_texts/" \
--train.max_tokens 1_000_000 \
--out_dir out/custom-model
Continue pretraining the model
litgpt pretrain pythia-160m \
--initial_checkpoint_dir out/custom-model/final \
--tokenizer_dir EleutherAI/pythia-160m \
--out_dir new_checkpoint \
--data TextFiles \
--data.train_data_path "custom_texts/"
results in
RuntimeError: Error(s) in loading state_dict for GPT:
Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", bias", ..."transformer.h.5.mlp.proj.bias", "transformer.ln_f.weight", "transformer.ln_f.bias".
Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".
The specific issue is that the pretrained saves the things like the iter_num
etc. So, if you are continuing pretraining from an existing pretrained checkpoint (which is a bit different from a pretrained downloaded checkpoint from the hub), you need to provide the --resume
option:
litgpt pretrain pythia-160m \
--resume "auto" \
--tokenizer_dir EleutherAI/pythia-160m \
--out_dir out/custom-model-2 \
--data TextFiles \
--data.train_data_path "custom_texts/"
There may be other ways to do it with a conversion like mentioned above.
Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume
still work? Wouldn't it try to load the train_dataloader
state, number of steps etc from the previous dataset/run and cause some issues?
Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would
--resume
still work? Wouldn't it try to load thetrain_dataloader
state, number of steps etc from the previous dataset/run and cause some issues?
I've tried it and it works with the --resume "auto"
. The previous traindataloader step count is inherited and continues to be counted. But here is something need to know:
1. If you want to continue pretraining in different dataset, you need to set --resume "auto"
and make sure your out_dir
doesn't change.
2. if you want to change out_dir
, and in this case, --resume "auto"
can't load your previous checkpoint cause new out_dir
doesn't have any checkpoint, and if you set --resume '/llama_tinystory2_en/step-00050000'
manually, it will cause issues:
[rank1]: ValueError: The path '/llama_tinystory2_en/step-00050000' does not point to a valid checkpoint. Make sure the path points to either a directory with FSDP checkpoint shards, or a single file with a full checkpoint.
you need to set the lit_model.pth path(/llama_tinystory2_en/step-00050000/litmodel.pth
) so that You can achieve the same effect as in point 1.
Thanks for trying @wodelt. I am still a bit concerned from my read of the code:
In Line 216, a train_dataloader
is initialized from the new paths in the config. However, Line 233 now loads something from the checkpoint into state -> train_dataloader
. As you have seem the iteration number is definitely loaded from the older data loader. What I am unsure about is whether the old paths from the old dataset are also loaded into the "new" train_dataloader
, effectively nulling out Line 216.
I'm hoping @rasbt has better insight into this saving/loading.
I want to continue-pretraining my custom model in another dataset, so i only change initial_checkpoint_dir in training.yaml with the latest-run checkpoint dir path, but seems like the model can't be loaded correctly:
[rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank0]: RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel: [rank0]: Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight", "transformer.h.0.norm_2.weight", "transformer.h.0.mlp.fc_1.weight", "transformer.h.0.mlp.fc_2.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.norm_1.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight", "transformer.h.1.norm_2.weight", "transformer.h.1.mlp.fc_1.weight", "transformer.h.1.mlp.fc_2.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.norm_1.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight", "transformer.h.2.norm_2.weight", "transformer.h.2.mlp.fc_1.weight", "transformer.h.2.mlp.fc_2.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.norm_1.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight", "transformer.h.3.norm_2.weight", "transformer.h.3.mlp.fc_1.weight", "transformer.h.3.mlp.fc_2.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.norm_1.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight", "transformer.h.4.norm_2.weight", "transformer.h.4.mlp.fc_1.weight", "transformer.h.4.mlp.fc_2.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.norm_1.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight", "transformer.h.5.norm_2.weight", "transformer.h.5.mlp.fc_1.weight", "transformer.h.5.mlp.fc_2.weight", "transformer.h.5.mlp.proj.weight", "transformer.ln_f.weight". [rank0]: Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".
I don't understand the error cause i didn't change the model_config.