use initial_checkpoint_dir for continue-pretraining but can't load model correctly

wodelt commented 5 days ago

I want to continue-pretraining my custom model in another dataset, so i only change initial_checkpoint_dir in training.yaml with the latest-run checkpoint dir path, but seems like the model can't be loaded correctly:

[rank0]: raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( [rank0]: RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel: [rank0]: Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", "transformer.h.0.norm_1.weight", "transformer.h.0.attn.attn.weight", "transformer.h.0.attn.proj.weight", "transformer.h.0.norm_2.weight", "transformer.h.0.mlp.fc_1.weight", "transformer.h.0.mlp.fc_2.weight", "transformer.h.0.mlp.proj.weight", "transformer.h.1.norm_1.weight", "transformer.h.1.attn.attn.weight", "transformer.h.1.attn.proj.weight", "transformer.h.1.norm_2.weight", "transformer.h.1.mlp.fc_1.weight", "transformer.h.1.mlp.fc_2.weight", "transformer.h.1.mlp.proj.weight", "transformer.h.2.norm_1.weight", "transformer.h.2.attn.attn.weight", "transformer.h.2.attn.proj.weight", "transformer.h.2.norm_2.weight", "transformer.h.2.mlp.fc_1.weight", "transformer.h.2.mlp.fc_2.weight", "transformer.h.2.mlp.proj.weight", "transformer.h.3.norm_1.weight", "transformer.h.3.attn.attn.weight", "transformer.h.3.attn.proj.weight", "transformer.h.3.norm_2.weight", "transformer.h.3.mlp.fc_1.weight", "transformer.h.3.mlp.fc_2.weight", "transformer.h.3.mlp.proj.weight", "transformer.h.4.norm_1.weight", "transformer.h.4.attn.attn.weight", "transformer.h.4.attn.proj.weight", "transformer.h.4.norm_2.weight", "transformer.h.4.mlp.fc_1.weight", "transformer.h.4.mlp.fc_2.weight", "transformer.h.4.mlp.proj.weight", "transformer.h.5.norm_1.weight", "transformer.h.5.attn.attn.weight", "transformer.h.5.attn.proj.weight", "transformer.h.5.norm_2.weight", "transformer.h.5.mlp.fc_1.weight", "transformer.h.5.mlp.fc_2.weight", "transformer.h.5.mlp.proj.weight", "transformer.ln_f.weight". [rank0]: Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".

I don't understand the error cause i didn't change the model_config.

fdalvi commented 4 days ago

I've faced similar issues, usually I convert my models to HF format for some other parts of my pipeline, and converting back from HF to LitGPT resolves this error.

Alternatively, https://github.com/Lightning-AI/litgpt/blob/main/litgpt/scripts/convert_pretrained_checkpoint.py seems to be also meant for this purpose. Perhaps you can try that while the maintainers reply with a more concrete solution!

rasbt commented 3 days ago

Download some data

mkdir -p custom_texts
curl https://www.gutenberg.org/cache/epub/24440/pg24440.txt --output custom_texts/book1.txt
curl https://www.gutenberg.org/cache/epub/26393/pg26393.txt --output custom_texts/book2.txt

Download tokenizer

litgpt download EleutherAI/pythia-160m \
  --tokenizer_only True

Pretrain model

litgpt pretrain EleutherAI/pythia-160m \
  --tokenizer_dir EleutherAI/pythia-160m \
  --data TextFiles \
  --data.train_data_path "custom_texts/" \
  --train.max_tokens 1_000_000 \
  --out_dir out/custom-model

Continue pretraining the model

litgpt pretrain pythia-160m \
   --initial_checkpoint_dir out/custom-model/final \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir new_checkpoint \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

results in

RuntimeError: Error(s) in loading state_dict for GPT:
        Missing key(s) in state_dict: "lm_head.weight", "transformer.wte.weight", bias", ..."transformer.h.5.mlp.proj.bias", "transformer.ln_f.weight", "transformer.ln_f.bias".
        Unexpected key(s) in state_dict: "model", "optimizer", "train_dataloader", "iter_num", "step_count".

The specific issue is that the pretrained saves the things like the iter_num etc. So, if you are continuing pretraining from an existing pretrained checkpoint (which is a bit different from a pretrained downloaded checkpoint from the hub), you need to provide the --resume option:

litgpt pretrain pythia-160m \
   --resume "auto" \
   --tokenizer_dir EleutherAI/pythia-160m \
   --out_dir out/custom-model-2 \
   --data TextFiles \
   --data.train_data_path "custom_texts/"

There may be other ways to do it with a conversion like mentioned above.

fdalvi commented 2 days ago

Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume still work? Wouldn't it try to load the train_dataloader state, number of steps etc from the previous dataset/run and cause some issues?

wodelt commented 1 day ago

Thanks @rasbt, if someone is continuing with a different dataset etc (like OP), would --resume still work? Wouldn't it try to load the train_dataloader state, number of steps etc from the previous dataset/run and cause some issues?

I've tried it and it works with the --resume "auto" . The previous traindataloader step count is inherited and continues to be counted. But here is something need to know:

1. If you want to continue pretraining in different dataset, you need to set --resume "auto" and make sure your out_dir doesn't change.

2. if you want to change out_dir, and in this case, --resume "auto" can't load your previous checkpoint cause new out_dir doesn't have any checkpoint, and if you set --resume '/llama_tinystory2_en/step-00050000' manually, it will cause issues:

[rank1]: ValueError: The path '/llama_tinystory2_en/step-00050000' does not point to a valid checkpoint. Make sure the path points to either a directory with FSDP checkpoint shards, or a single file with a full checkpoint.

you need to set the lit_model.pth path(/llama_tinystory2_en/step-00050000/litmodel.pth) so that You can achieve the same effect as in point 1.

fdalvi commented 1 day ago

Thanks for trying @wodelt. I am still a bit concerned from my read of the code:

https://github.com/Lightning-AI/litgpt/blob/ef886a791ca000e2e7beac10686fd10074d6603d/litgpt/pretrain.py#L216-L233

In Line 216, a train_dataloader is initialized from the new paths in the config. However, Line 233 now loads something from the checkpoint into state -> train_dataloader. As you have seem the iteration number is definitely loaded from the older data loader. What I am unsure about is whether the old paths from the old dataset are also loaded into the "new" train_dataloader, effectively nulling out Line 216.

I'm hoping @rasbt has better insight into this saving/loading.

Lightning-AI / litgpt

use initial_checkpoint_dir for continue-pretraining but can't load model correctly #1729