RunTime Error: Error(s) in loading state_dict for TransformerModel

muhammed-saeed commented 2 years ago

🐛 Bug

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

1- Hi, Whenever I download the entire folder of the checkpoint from the Virtual machine server into my personal computer or even transfer checkpoint from one VM to another I encounter the following error:

RuntimeError: Error(s) in loading state_dict for TransformerModel: Unexpected key(s) in state_dict: "encoder.layers.0.in_proj_weight", "encoder.layers.0.in_proj_bias", "encoder.layers.1.in_proj_weight", "encoder.layers.1.in_proj_bias", "encoder.layers.2.in_proj_weight", "encoder.layers.2.in_proj_bias", "encoder.layers.3.in_proj_weight", "encoder.layers.3.in_proj_bias", "encoder.layers.4.in_proj_weight", "encoder.layers.4.in_proj_bias", "encoder.layers.5.in_proj_weight", "encoder.layers.5.in_proj_bias".

I have trained different models and encountered this issue whenever I transfer a checkpoint from one VM to the other the checkpoint doesn't get loaded in the new VM.

Interestingly the checkpoint works well without any issues in the machine that it's trained on but not in any other device "VM or local computer even GCP ", do you have any suggestions?

gwenzek commented 2 years ago

this is probably due to a different version of fairseq used during training and loading. You probably have an older version installed locally.

Please check the two fairseq versions. If you have the same, please share a specific fairseq-train command with all the parameters so we can reproduce.

muhammed-saeed commented 2 years ago

Thanks for your response I have checked the fairseq version its __version__ = "0.12.2, I have also checked pytorch version, and its also 1.12.1+cu102 in both machines. currently, I am doing fairseq listen on the trained model but here is the code part of the script

pd2en_large_model_path="/home/mohammed_yahia3/models/"
pd2en_small_model_path='/home/CE/musaeed/kd-distiller/checkpoints/'
pd2en_small_model_preprocess='/home/CE/musaeed/FAKE_pd_en.tokenized.pd-en'
pd2en_large_model_preprocess="/home/mohammed_yahia3/models/BT_pd_en.tokenized.pd-en"

pd2en =  TransformerModel.from_pretrained(pd2en_large_model_path  ,
  checkpoint_file='checkpoint_last.pt',
  data_name_or_path=pd2en_large_model_preprocess,
  bpe='sentencepiece',
  sentencepiece_model='/home/mohammed_yahia3/models/pd__vocab_4000.model'
)

dsj96 commented 1 year ago

Hello! I have the save problem.

Environment:

torch                        1.9.1+cu111
Python 3.8.8
Linux: Ubuntu 20.04.5 LTS
install fiarseq by pip cmd(pip install --editable ./)

(base) root@x517b-task0: /fairseq# git branch -r
  origin/0.12.2-release
  origin/0.12.3-release
  origin/HEAD -> origin/main
  origin/adaptor_pad_fix
  origin/adding_womenbios

run.sh

CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-train \
    data-bin/wmt16_en_ro_first_hf \
    --ddp-backend=legacy_ddp \
    --arch transformer_wmt_en_de --share-decoder-input-output-embed  -s 'en' -t 'ro' \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 --max-epoch 2 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --log-file ./log/wmt16_en_ro_first_hf_log.txt \
    --scoring sacrebleu \
    --max-tokens 8192 \
    --save-dir checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de \
    --no-epoch-checkpoints \
    --memory-efficient-fp16 \
    --distributed-world-size 4 \
    --nprocs-per-node 4 \

# evaluate
CUDA_VISIBLE_DEVICES=0,1,2,3 fairseq-generate data-bin/wmt16_en_ro_first_hf \
    --path checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de/checkpoint_last.pt \
    --batch-size 128 --beam 5 --remove-bpe

I use the cmd above to obtain the checkpoint_last.pt, however when I load the checkpoint, problem appears :

checkpoint = BaseFairseqModel.from_pretrained('checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de', checkpoint_file="checkpoint_last.pt")

error info:

*** RuntimeError: Error(s) in loading state_dict for TransformerModel:
        Unexpected key(s) in state_dict: "encoder.layers.0.in_proj_weight", "encoder.layers.0.in_proj_bias", "encoder.layers.0.out_proj_weight", "encoder.layers.0.out_proj_bias", "encoder.layers.0.fc1_weight", "encoder.layers.0.fc1_bias", "encoder.layers.0.fc2_weight", "encoder.layers.0.fc2_bias", "encoder.layers.1.in_proj_weight", "encoder.layers.1.in_proj_bias", "encoder.layers.1.out_proj_weight", "encoder.layers.1.out_proj_bias", "encoder.layers.1.fc1_weight", "encoder.layers.1.fc1_bias", "encoder.layers.1.fc2_weight", "encoder.layers.1.fc2_bias", "encoder.layers.2.in_proj_weight", "encoder.layers.2.in_proj_bias", "encoder.layers.2.out_proj_weight", "encoder.layers.2.out_proj_bias", "encoder.layers.2.fc1_weight", "encoder.layers.2.fc1_bias", "encoder.layers.2.fc2_weight", "encoder.layers.2.fc2_bias", "encoder.layers.3.in_proj_weight", "encoder.layers.3.in_proj_bias", "encoder.layers.3.out_proj_weight", "encoder.layers.3.out_proj_bias", "encoder.layers.3.fc1_weight", "encoder.layers.3.fc1_bias", "encoder.layers.3.fc2_weight", "encoder.layers.3.fc2_bias", "encoder.layers.4.in_proj_weight", "encoder.layers.4.in_proj_bias", "encoder.layers.4.out_proj_weight", "encoder.layers.4.out_proj_bias", "encoder.layers.4.fc1_weight", "encoder.layers.4.fc1_bias", "encoder.layers.4.fc2_weight", "encoder.layers.4.fc2_bias", "encoder.layers.5.in_proj_weight", "encoder.layers.5.in_proj_bias", "encoder.layers.5.out_proj_weight", "encoder.layers.5.out_proj_bias", "encoder.layers.5.fc1_weight", "encoder.layers.5.fc1_bias", "encoder.layers.5.fc2_weight", "encoder.layers.5.fc2_bias".

Since encoder.layers.0.in_proj_weight should be encoder.layers.0.in_proj.weight. And when I use the torch.load().

(Pdb) pt =torch.load("checkpoints/wmt16_en_ro_first_hf/transformer_wmt_en_de/checkpoint_last.pt")
(Pdb) pt['model']["encoder.layers.0.in_proj_weight"]
tensor([[0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        ...,
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]])
(Pdb) pt['model']["encoder.layers.0.in_proj.weight"]
*** KeyError: 'encoder.layers.0.in_proj.weight

I trained the checkpoint_last.pt, but the param is 0. This is weird.

want to help

why params like encoder.layers.0.in_proj_weight become encoder.layers.0.in_proj.weight.
why params like encoder.layers.0.in_proj_weight is 0 after trained.

miko8422 commented 1 year ago

Have you solved this problem already?I'm still trying test my env by using an older version of fairseq0.11.1 and run the demo de-en tasks. Please let me know if you solved this problem.

lancioni commented 7 months ago

I found the same error when training a model on Colab and launching fairseq-generate on my machine. Of course it is pretty absurd that Fairseq, which boasts its models are plain pytorch objects, cannot be launched on another machine. Ironically enough, if I convert the model with CTranslate2 it works just fine. Paradoxically, I can use a fairseq trained model with CTranslate2 but not with fairseq. Almost unbelievable.

hi-i-m-GTooth commented 4 months ago

I also met this bug. I print out the state_dict of workable checkpoint & unworkable checkpoint from fairseq: orig_bart_large_state_dict.txt after_fairseq_train_state_dict.txt

I find that the checkpoint after the training from fairseq is added some unexpected keys:

for l in range(0, 11+1):
        ignore_keys.append(f'encoder.layers.{l}.in_proj_weight')
        ignore_keys.append(f'encoder.layers.{l}.in_proj_bias')
        ignore_keys.append(f'encoder.layers.{l}.out_proj_weight')
        ignore_keys.append(f'encoder.layers.{l}.out_proj_bias')
        ignore_keys.append(f'encoder.layers.{l}.fc1_weight')
        ignore_keys.append(f'encoder.layers.{l}.fc1_bias')
        ignore_keys.append(f'encoder.layers.{l}.fc2_weight')
        ignore_keys.append(f'encoder.layers.{l}.fc2_bias')

I pop them from the state_dict and it works. However, I didn't get deeper to know why this happened in fairseq's training process. Not sure if this is fine. But the inference scores are acceptable.

facebookresearch / fairseq