Size of matrix mismatch error when using pre-trained model(transformer.wmt19.de-en)

Hello

I'm trying to finetune the provided pretrained model transformer.wmt19.de-en from the paper(Facebook FAIR’s WMT19 News Translation Task Submission). However, I cannot find the correct architecture for this pre-trained model. According to the paper, seems like 'transformer_vaswani_wmt_en_de_big' is used but it doesn't fit with the pre-trained model. Furthermore, I tried out all sensible other architectures, such as transformer, wmt_en_de_big but it didn't work either.

I preprocessed the data usin this command ; fairseq-preprocess --source-lang de --target-lang en \ --trainpref $TEXT/train \ --validpref $TEXT/valid \ --testpref $TEXT/test \ --destdir data-bin/wmt19.tokenized.de-en \ --workers 20 \ --joined-dictionary --srcdict ../models/wmt19.de-en.joined-dict.ensemble/dict.de.txt \

And afterwards finetune like this ; fairseq-train \ data-bin/wmt19.tokenized.de-en \ --restore-file ../models/wmt19.de-en.joined-dict.ensemble/model1.pt \ --save-dir checkpoints/finetune_wmt_model1 \ --arch transformer_vaswani_wmt_en_de_big --share-decoder-input-output-embed \ --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \ --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \ --dropout 0.3 --weight-decay 0.0001 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --eval-bleu \ --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \ --eval-bleu-detok moses \ --eval-bleu-remove-bpe \ --eval-bleu-print-samples \ --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \ --max-sentences 100

Then it gives me this error ;

Traceback (most recent call last): File "/data/s3475743/myver_fairseq/fairseq/fairseq/trainer.py", line 256, in load_checkpoint self.get_model().load_state_dict( File "/data/s3475743/myver_fairseq/fairseq/fairseq/models/fairseq_model.py", line 93, in load_state_dict return super().load_state_dict(new_state_dict, strict) File "/data/s3475743/myver_fairseq/venv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 846, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for TransformerModel: size mismatch for encoder.layers.0.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for encoder.layers.0.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for encoder.layers.0.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for encoder.layers.1.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for encoder.layers.1.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for encoder.layers.1.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for encoder.layers.2.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for encoder.layers.2.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for encoder.layers.2.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for encoder.layers.3.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for encoder.layers.3.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for encoder.layers.3.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for encoder.layers.4.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for encoder.layers.4.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for encoder.layers.4.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([1024, 4096]). size mismatch for encoder.layers.5.fc1.weight: copying a param with shape torch.Size([8192, 1024]) from checkpoint, the shape in current model is torch.Size([4096, 1024]). size mismatch for encoder.layers.5.fc1.bias: copying a param with shape torch.Size([8192]) from checkpoint, the shape in current model is torch.Size([4096]). size mismatch for encoder.layers.5.fc2.weight: copying a param with shape torch.Size([1024, 8192]) from checkpoint, the shape in current model is torch.Size([1024, 4096]).

During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/data/s3475743/myver_fairseq/venv/bin/fairseq-train", line 33, in sys.exit(load_entry_point('fairseq', 'console_scripts', 'fairseq-train')()) File "/data/s3475743/myver_fairseq/fairseq/fairseq_cli/train.py", line 352, in cli_main distributed_utils.call_main(args, main) File "/data/s3475743/myver_fairseq/fairseq/fairseq/distributed_utils.py", line 189, in call_main main(args, **kwargs) File "/data/s3475743/myver_fairseq/fairseq/fairseq_cli/train.py", line 106, in main extra_state, epoch_itr = checkpoint_utils.load_checkpoint(args, trainer) File "/data/s3475743/myver_fairseq/fairseq/fairseq/checkpoint_utils.py", line 134, in load_checkpoint extra_state = trainer.load_checkpoint( File "/data/s3475743/myver_fairseq/fairseq/fairseq/trainer.py", line 264, in load_checkpoint raise Exception( Exception: Cannot load model parameters from checkpoint ../models/wmt19.de-en.joined-dict.ensemble/model1.pt; please ensure that the architectures match.

Maybe the architecture was changed after saving the pretrained model or am I just doing things plain wrong? I hope you can help me figure out how to load the pretrained models when finetuning because I kind of ran out of ideas whats going wrong.

You can determine the arguments/architecture by loading the model checkpoint and checking the 'args' attribute:

>>> import torch
>>> model = torch.load('wmt19.en-de.joined-dict.ensemble/model1.pt')
>>> model['args']
Namespace(adam_betas='(0.9, 0.98)', adam_eps=1e-08, adaptive_input=False, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, arch='transformer_wmt_en_de_big', attention_dropout=0.1, bucket_cap_mb=25, clip_norm=0.0, cpu=False, criterion='label_smoothed_cross_entropy', data=['/private/home/edunov/wmt19/data/old/ende', '/private/home/edunov/wmt19/data/old/ende', '/private/home/edunov/wmt19/data/finetune/nc'], ddp_backend='c10d', decoder_attention_heads=16, decoder_embed_dim=1024, decoder_embed_path=None, decoder_ffn_embed_dim=4096, decoder_input_dim=1024, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=False, decoder_output_dim=1024, device_id=0, distributed_backend='nccl', distributed_init_method='tcp://localhost:17406', distributed_port=-1, distributed_rank=0, distributed_world_size=2, dropout=0.2, encoder_attention_heads=16, encoder_embed_dim=1024, encoder_embed_path=None, encoder_ffn_embed_dim=8192, encoder_layers=6, encoder_learned_pos=False, encoder_normalize_before=False, extra_data='', fix_batches_to_gpus=False, fp16=True, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, keep_interval_updates=-1, keep_last_epochs=-1, label_smoothing=0.1, lazy_load=False, left_pad_source=True, left_pad_target=False, log_format='simple', log_interval=100, lr=[0.0007], lr_scheduler='inverse_sqrt', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_source_positions=1024, max_target_positions=1024, max_tokens=3584, max_update=201800, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-09, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=True, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='adam', optimizer_overrides='{}', raw_text=False, relu_dropout=0.0, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', save_dir='/checkpoint/edunov/20190403/wmt19en2de.btsample5.ffn8192.transformer_wmt_en_de_big_bsz3584_lr0.0007_dr0.2_size_updates200000_seed20_lbsm0.1_size_sa1_upsample2//finetune1', save_interval=1, save_interval_updates=200, seed=2, sentence_avg=False, share_all_embeddings=True, share_decoder_input_output_embed=True, skip_invalid_size_inputs_valid_test=False, source_lang='en', target_lang='de', task='translation', tensorboard_logdir='', threshold_loss_scale=None, train_subset='train', update_freq=[1], upsample_primary=1, user_dir=None, valid_subset='valid', validate_interval=1, warmup_init_lr=1e-07, warmup_updates=4000, weight_decay=0.0)

While it started with transformer_vaswani_wmt_en_de_big there are some customizations to other parameter. The main change seems to be --encoder-ffn-embed-dim=8192.

facebookresearch / fairseq

Size of matrix mismatch error when using pre-trained model(transformer.wmt19.de-en) #2489