Unlikelihood Wizard Of Wikipedia Label Repetition Model: Error while fine tuning

Hello! I want to fine tune unlikelihood repetition model on custom task that is similar to the wizard of Wikipedia task.

Here is my code

from parlai.scripts.train_model import TrainModel
from parlai.scripts.multiprocessing_train import MultiProcessTrain

MultiProcessTrain.main(
  task='wizard_of_wikipedia:GeneratorTeacher',
  datatype='train',
  datapath='/home/saraharas/datapath',
  model='projects.dialogue_unlikelihood.agents:RepetitionUnlikelihoodAgent',
  init_model='zoo:dialogue_unlikelihood/rep_wiki_label/model',
  dict_file='zoo:dialogue_unlikelihood/rep_wiki_label/model.dict',
  skip_generation=False,
  batchsize=64,
)

But MultiProcessTrain script fails with the following error

<super: <class 'RepetitionUnlikelihoodAgentTrait'>, <RepetitionUnlikelihoodAgent object>>
<super: <class 'RepetitionUnlikelihoodAgentTrait'>, <RepetitionUnlikelihoodAgent object>>
21:51:58 | Distributed group initialized
21:51:58 | building dictionary first...
21:51:58 | your model is being loaded with opts that do not exist in the model you are initializing the weights with: allow_missing_init_opts: False,download_path: None,loglevel: info,dynamic_batching: None,datapath: /home/saraharas/datapath,tensorboard_logdir: None,distributed_world_size: 5,verbose: False,chosen_topic_delimiter: 
,gold_knowledge_delimiter: 
,n_encoder_layers: -1,n_decoder_layers: -1,model_parallel: False,beam_context_block_ngram: -1,beam_block_full_context: True,beam_length_penalty: 0.65,beam_delay: 30,beam_block_list_filename: None,temperature: 1.0,compute_tokenized_bleu: False,interactive_mode: False,fp16_impl: apex,force_fp16_tokens: False,adafactor_eps: (1e-30, 0.001),history_reversed: False,history_add_global_end_token: None,special_tok_lst: None,bpe_vocab: None,bpe_merge: None,bpe_add_prefix_space: None,hf_skip_special_tokens: True,max_lr_steps: -1,invsqrt_lr_decay_gamma: -1,n_image_tokens: 1,n_image_channels: 1,image_fusion_type: late,rank: 0,multiprocessing: True
21:51:58 | your model is being loaded with opts that differ from the model you are initializing the weights with. Add the following args to your run command to change this: 
--show-advanced-args False --image-mode none --numthreads 1 --batchsize 24 --model parlai_internal.projects.unlikelihood.agents:RepetitionUnlikelihoodParlallAgent --eval-batchsize 64 --max-train-time 54000.0 --validation-every-n-secs 3600.0 --save-every-n-secs 3600.0 --validation-every-n-epochs 10000.0 --validation-max-exs 500 --validation-metric ppl_irep4 --validation-metric-mode min --metrics all --numworkers 4 --pytorch-preprocess False --pytorch-teacher-batch-sort False --batch-sort-cache-type pop --batch-length-range 5 --shuffle False --batch-sort-field text --pytorch-context-length -1 --pytorch-include-labels True --prepend-gold-knowledge True --embedding-size 512 --n-layers 8 --ffn-size 2048 --dropout 0.1 --n-heads 16 --learn-positional-embeddings True --n-positions 512 --variant xlm --activation gelu --optimizer adamax --learningrate 7.5e-06 --lr-scheduler-decay 0.9 --text-truncate 512 --label-truncate 128 --gpu -1 --dict-tokenizer bpe --dict-lower True --ctxt-beta 0.0 --train-to-convergence True --parlai-home /private/home/jase/src/ParlAI
21:51:58 | Using CUDA
21:51:58 | loading dictionary from /home/saraharas/datapath/models/dialogue_unlikelihood/rep_wiki_label/model.dict
21:51:58 | num words = 54946
21:52:00 | Total parameters: 20,608,500 (19,994,100 trainable)
21:52:00 | Loading existing model params from /home/saraharas/datapath/models/dialogue_unlikelihood/rep_wiki_label/model
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-3-ef9278cdc383> in <module>
     20     # speeds up validation
     21     skip_generation=False,
---> 22     batchsize=64,#per gpu
     23 )
     24 '''

~/parlaivenv/lib/python3.7/site-packages/parlai/core/script.py in main(cls, *args, **kwargs)
    106             return cls._run_args(args)
    107         elif kwargs:
--> 108             return cls._run_kwargs(kwargs)
    109         else:
    110             return cls._run_args(None)

~/parlaivenv/lib/python3.7/site-packages/parlai/core/script.py in _run_kwargs(cls, kwargs)
     72         parser = cls.setup_args()
     73         opt = parser.parse_kwargs(**kwargs)
---> 74         return cls._run_from_parser_and_opt(opt, parser)
     75 
     76     @classmethod

~/parlaivenv/lib/python3.7/site-packages/parlai/core/script.py in _run_from_parser_and_opt(cls, opt, parser)
     87         script = cls(opt)
     88         script.parser = parser
---> 89         return script.run()
     90 
     91     @classmethod

~/parlaivenv/lib/python3.7/site-packages/parlai/scripts/multiprocessing_train.py in run(self)
     86     def run(self):
     87         port = random.randint(32000, 48000)
---> 88         return launch_and_train(self.opt, port)
     89 
     90 

~/parlaivenv/lib/python3.7/site-packages/parlai/scripts/multiprocessing_train.py in launch_and_train(opt, port)
     60 
     61     try:
---> 62         retval = multiprocess_train(0, opt, port)
     63         spawncontext.join()
     64         return retval

~/parlaivenv/lib/python3.7/site-packages/parlai/scripts/multiprocessing_train.py in multiprocess_train(rank, opt, port, rank_offset, gpu, hostname)
     42         # Run the actual training
     43         opt['multiprocessing'] = True
---> 44         return single_train.TrainLoop(opt).train()
     45 
     46 

~/parlaivenv/lib/python3.7/site-packages/parlai/scripts/train_model.py in __init__(self, opt)
    280 
    281         # Create model and assign it to the specified task
--> 282         self.agent = create_agent(opt)
    283         self.agent.opt.log()
    284         self.world = create_task(opt, self.agent)

~/parlaivenv/lib/python3.7/site-packages/parlai/core/agents.py in create_agent(opt, requireModelExists)
    411         # loaded ones
    412         compare_init_model_opts(opt, opt)
--> 413         model = model_class(opt)
    414         if requireModelExists and hasattr(model, 'load') and not opt.get('model_file'):
    415             # double check that we didn't forget to set model_file on loadable model

~/parlaivenv/lib/python3.7/site-packages/projects/dialogue_unlikelihood/agents.py in __init__(self, opt, shared)
    148 
    149     def __init__(self, opt, shared=None):
--> 150         super().__init__(opt, shared)
    151         self.pred_logsoftmax = torch.nn.LogSoftmax(dim=2)
    152 

~/parlaivenv/lib/python3.7/site-packages/parlai/core/torch_generator_agent.py in __init__(self, opt, shared)
    525                 # load model parameters if available
    526                 logging.info(f'Loading existing model params from {init_model}')
--> 527                 states = self.load(init_model)
    528             else:
    529                 states = {}

~/parlaivenv/lib/python3.7/site-packages/parlai/core/torch_agent.py in load(self, path)
   1867             )
   1868         if 'model' in states:
-> 1869             self.load_state_dict(states['model'])
   1870         if 'optimizer' in states and hasattr(self, 'optimizer'):
   1871             self.optimizer.load_state_dict(states['optimizer'])

~/parlaivenv/lib/python3.7/site-packages/parlai/agents/image_seq2seq/image_seq2seq.py in load_state_dict(self, state_dict)
    201                     or embs.shape[1] != init_embs.shape[1]
    202                 ):
--> 203                     raise e
    204 
    205                 state_dict.update(

~/parlaivenv/lib/python3.7/site-packages/parlai/agents/image_seq2seq/image_seq2seq.py in load_state_dict(self, state_dict)
    189         if self.opt['init_model'] is not None:
    190             try:
--> 191                 self.model.load_state_dict(state_dict)
    192                 return
    193             except RuntimeError as e:

~/parlaivenv/lib/python3.7/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
   1050         if len(error_msgs) > 0:
   1051             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
-> 1052                                self.__class__.__name__, "\n\t".join(error_msgs)))
   1053         return _IncompatibleKeys(missing_keys, unexpected_keys)
   1054 

RuntimeError: Error(s) in loading state_dict for ImageSeq2seqModel:
    Unexpected key(s) in state_dict: "encoder.norm_embeddings.weight", "encoder.norm_embeddings.bias", "encoder.layers.2.attention.q_lin.weight", "encoder.layers.2.attention.q_lin.bias", "encoder.layers.2.attention.k_lin.weight", "encoder.layers.2.attention.k_lin.bias", "encoder.layers.2.attention.v_lin.weight", "encoder.layers.2.attention.v_lin.bias", "encoder.layers.2.attention.out_lin.weight", "encoder.layers.2.attention.out_lin.bias", "encoder.layers.2.norm1.weight", "encoder.layers.2.norm1.bias", "encoder.layers.2.ffn.lin1.weight", "encoder.layers.2.ffn.lin1.bias", "encoder.layers.2.ffn.lin2.weight", "encoder.layers.2.ffn.lin2.bias", "encoder.layers.2.norm2.weight", "encoder.layers.2.norm2.bias", "encoder.layers.3.attention.q_lin.weight", "encoder.layers.3.attention.q_lin.bias", "encoder.layers.3.attention.k_lin.weight", "encoder.layers.3.attention.k_lin.bias", "encoder.layers.3.attention.v_lin.weight", "encoder.layers.3.attention.v_lin.bias", "encoder.layers.3.attention.out_lin.weight", "encoder.layers.3.attention.out_lin.bias", "encoder.layers.3.norm1.weight", "encoder.layers.3.norm1.bias", "encoder.layers.3.ffn.lin1.weight", "encoder.layers.3.ffn.lin1.bias", "encoder.layers.3.ffn.lin2.weight", "encoder.layers.3.ffn.lin2.bias", "encoder.layers.3.norm2.weight", "encoder.layers.3.norm2.bias", "encoder.layers.4.attention.q_lin.weight", "encoder.layers.4.attention.q_lin.bias", "encoder.layers.4.attention.k_lin.weight", "encoder.layers.4.attention.k_lin.bias", "encoder.layers.4.attention.v_lin.weight", "encoder.layers.4.attention.v_lin.bias", "encoder.layers.4.attention.out_lin.weight", "encoder.layers.4.attention.out_lin.bias", "encoder.layers.4.norm1.weight", "encoder.layers.4.norm1.bias", "encoder.layers.4.ffn.lin1.weight", "encoder.layers.4.ffn.lin1.bias", "encoder.layers.4.ffn.lin2.weight", "encoder.layers.4.ffn.lin2.bias", "encoder.layers.4.norm2.weight", "encoder.layers.4.norm2.bias", "encoder.layers.5.attention.q_lin.weight", "encoder.layers.5.attention.q_lin.bias", "encoder.layers.5.attention.k_lin.weight", "encoder.layers.5.attention.k_lin.bias", "encoder.layers.5.attention.v_lin.weight", "encoder.layers.5.attention.v_lin.bias", "encoder.layers.5.attention.out_lin.weight", "encoder.layers.5.attention.out_lin.bias", "encoder.layers.5.norm1.weight", "encoder.layers.5.norm1.bias", "encoder.layers.5.ffn.lin1.weight", "encoder.layers.5.ffn.lin1.bias", "encoder.layers.5.ffn.lin2.weight", "encoder.layers.5.ffn.lin2.bias", "encoder.layers.5.norm2.weight", "encoder.layers.5.norm2.bias", "encoder.layers.6.attention.q_lin.weight", "encoder.layers.6.attention.q_lin.bias", "encoder.layers.6.attention.k_lin.weight", "encoder.layers.6.attention.k_lin.bias", "encoder.layers.6.attention.v_lin.weight", "encoder.layers.6.attention.v_lin.bias", "encoder.layers.6.attention.out_lin.weight", "encoder.layers.6.attention.out_lin.bias", "encoder.layers.6.norm1.weight", "encoder.layers.6.norm1.bias", "encoder.layers.6.ffn.lin1.weight", "encoder.layers.6.ffn.lin1.bias", "encoder.layers.6.ffn.lin2.weight", "encoder.layers.6.ffn.lin2.bias", "encoder.layers.6.norm2.weight", "encoder.layers.6.norm2.bias", "encoder.layers.7.attention.q_lin.weight", "encoder.layers.7.attention.q_lin.bias", "encoder.layers.7.attention.k_lin.weight", "encoder.layers.7.attention.k_lin.bias", "encoder.layers.7.attention.v_lin.weight", "encoder.layers.7.attention.v_lin.bias", "encoder.layers.7.attention.out_lin.weight", "encoder.layers.7.attention.out_lin.bias", "encoder.layers.7.norm1.weight", "encoder.layers.7.norm1.bias", "encoder.layers.7.ffn.lin1.weight", "encoder.layers.7.ffn.lin1.bias", "encoder.layers.7.ffn.lin2.weight", "encoder.layers.7.ffn.lin2.bias", "encoder.layers.7.norm2.weight", "encoder.layers.7.norm2.bias", "decoder.norm_embeddings.weight", "decoder.norm_embeddings.bias", "decoder.layers.2.self_attention.q_lin.weight", "decoder.layers.2.self_attention.q_lin.bias", "decoder.layers.2.self_attention.k_lin.weight", "decoder.layers.2.self_attention.k_lin.bias", "decoder.layers.2.self_attention.v_lin.weight", "decoder.layers.2.self_attention.v_lin.bias", "decoder.layers.2.self_attention.out_lin.weight", "decoder.layers.2.self_attention.out_lin.bias", "decoder.layers.2.norm1.weight", "decoder.layers.2.norm1.bias", "decoder.layers.2.encoder_attention.q_lin.weight", "decoder.layers.2.encoder_attention.q_lin.bias", "decoder.layers.2.encoder_attention.k_lin.weight", "decoder.layers.2.encoder_attention.k_lin.bias", "decoder.layers.2.encoder_attention.v_lin.weight", "decoder.layers.2.encoder_attention.v_lin.bias", "decoder.layers.2.encoder_attention.out_lin.weight", "decoder.layers.2.encoder_attention.out_lin.bias", "decoder.layers.2.norm2.weight", "decoder.layers.2.norm2.bias", "decoder.layers.2.ffn.lin1.weight", "decoder.layers.2.ffn.lin1.bias", "decoder.layers.2.ffn.lin2.weight", "decoder.layers.2.ffn.lin2.bias", "decoder.layers.2.norm3.weight", "decoder.layers.2.norm3.bias", "decoder.layers.3.self_attention.q_lin.weight", "decoder.layers.3.self_attention.q_lin.bias", "decoder.layers.3.self_attention.k_lin.weight", "decoder.layers.3.self_attention.k_lin.bias", "decoder.layers.3.self_attention.v_lin.weight", "decoder.layers.3.self_attention.v_lin.bias", "decoder.layers.3.self_attention.out_lin.weight", "decoder.layers.3.self_attention.out_lin.bias", "decoder.layers.3.norm1.weight", "decoder.layers.3.norm1.bias", "decoder.layers.3.encoder_attention.q_lin.weight", "decoder.layers.3.encoder_attention.q_lin.bias", "decoder.layers.3.encoder_attention.k_lin.weight", "decoder.layers.3.encoder_attention.k_lin.bias", "decoder.layers.3.encoder_attention.v_lin.weight", "decoder.layers.3.encoder_attention.v_lin.bias", "decoder.layers.3.encoder_attention.out_lin.weight", "decoder.layers.3.encoder_attention.out_lin.bias", "decoder.layers.3.norm2.weight", "decoder.layers.3.norm2.bias", "decoder.layers.3.ffn.lin1.weight", "decoder.layers.3.ffn.lin1.bias", "decoder.layers.3.ffn.lin2.weight", "decoder.layers.3.ffn.lin2.bias", "decoder.layers.3.norm3.weight", "decoder.layers.3.norm3.bias", "decoder.layers.4.self_attention.q_lin.weight", "decoder.layers.4.self_attention.q_lin.bias", "decoder.layers.4.self_attention.k_lin.weight", "decoder.layers.4.self_attention.k_lin.bias", "decoder.layers.4.self_attention.v_lin.weight", "decoder.layers.4.self_attention.v_lin.bias", "decoder.layers.4.self_attention.out_lin.weight", "decoder.layers.4.self_attention.out_lin.bias", "decoder.layers.4.norm1.weight", "decoder.layers.4.norm1.bias", "decoder.layers.4.encoder_attention.q_lin.weight", "decoder.layers.4.encoder_attention.q_lin.bias", "decoder.layers.4.encoder_attention.k_lin.weight", "decoder.layers.4.encoder_attention.k_lin.bias", "decoder.layers.4.encoder_attention.v_lin.weight", "decoder.layers.4.encoder_attention.v_lin.bias", "decoder.layers.4.encoder_attention.out_lin.weight", "decoder.layers.4.encoder_attention.out_lin.bias", "decoder.layers.4.norm2.weight", "decoder.layers.4.norm2.bias", "decoder.layers.4.ffn.lin1.weight", "decoder.layers.4.ffn.lin1.bias", "decoder.layers.4.ffn.lin2.weight", "decoder.layers.4.ffn.lin2.bias", "decoder.layers.4.norm3.weight", "decoder.layers.4.norm3.bias", "decoder.layers.5.self_attention.q_lin.weight", "decoder.layers.5.self_attention.q_lin.bias", "decoder.layers.5.self_attention.k_lin.weight", "decoder.layers.5.self_attention.k_lin.bias", "decoder.layers.5.self_attention.v_lin.weight", "decoder.layers.5.self_attention.v_lin.bias", "decoder.layers.5.self_attention.out_lin.weight", "decoder.layers.5.self_attention.out_lin.bias", "decoder.layers.5.norm1.weight", "decoder.layers.5.norm1.bias", "decoder.layers.5.encoder_attention.q_lin.weight", "decoder.layers.5.encoder_attention.q_lin.bias", "decoder.layers.5.encoder_attention.k_lin.weight", "decoder.layers.5.encoder_attention.k_lin.bias", "decoder.layers.5.encoder_attention.v_lin.weight", "decoder.layers.5.encoder_attention.v_lin.bias", "decoder.layers.5.encoder_attention.out_lin.weight", "decoder.layers.5.encoder_attention.out_lin.bias", "decoder.layers.5.norm2.weight", "decoder.layers.5.norm2.bias", "decoder.layers.5.ffn.lin1.weight", "decoder.layers.5.ffn.lin1.bias", "decoder.layers.5.ffn.lin2.weight", "decoder.layers.5.ffn.lin2.bias", "decoder.layers.5.norm3.weight", "decoder.layers.5.norm3.bias", "decoder.layers.6.self_attention.q_lin.weight", "decoder.layers.6.self_attention.q_lin.bias", "decoder.layers.6.self_attention.k_lin.weight", "decoder.layers.6.self_attention.k_lin.bias", "decoder.layers.6.self_attention.v_lin.weight", "decoder.layers.6.self_attention.v_lin.bias", "decoder.layers.6.self_attention.out_lin.weight", "decoder.layers.6.self_attention.out_lin.bias", "decoder.layers.6.norm1.weight", "decoder.layers.6.norm1.bias", "decoder.layers.6.encoder_attention.q_lin.weight", "decoder.layers.6.encoder_attention.q_lin.bias", "decoder.layers.6.encoder_attention.k_lin.weight", "decoder.layers.6.encoder_attention.k_lin.bias", "decoder.layers.6.encoder_attention.v_lin.weight", "decoder.layers.6.encoder_attention.v_lin.bias", "decoder.layers.6.encoder_attention.out_lin.weight", "decoder.layers.6.encoder_attention.out_lin.bias", "decoder.layers.6.norm2.weight", "decoder.layers.6.norm2.bias", "decoder.layers.6.ffn.lin1.weight", "decoder.layers.6.ffn.lin1.bias", "decoder.layers.6.ffn.lin2.weight", "decoder.layers.6.ffn.lin2.bias", "decoder.layers.6.norm3.weight", "decoder.layers.6.norm3.bias", "decoder.layers.7.self_attention.q_lin.weight", "decoder.layers.7.self_attention.q_lin.bias", "decoder.layers.7.self_attention.k_lin.weight", "decoder.layers.7.self_attention.k_lin.bias", "decoder.layers.7.self_attention.v_lin.weight", "decoder.layers.7.self_attention.v_lin.bias", "decoder.layers.7.self_attention.out_lin.weight", "decoder.layers.7.self_attention.out_lin.bias", "decoder.layers.7.norm1.weight", "decoder.layers.7.norm1.bias", "decoder.layers.7.encoder_attention.q_lin.weight", "decoder.layers.7.encoder_attention.q_lin.bias", "decoder.layers.7.encoder_attention.k_lin.weight", "decoder.layers.7.encoder_attention.k_lin.bias", "decoder.layers.7.encoder_attention.v_lin.weight", "decoder.layers.7.encoder_attention.v_lin.bias", "decoder.layers.7.encoder_attention.out_lin.weight", "decoder.layers.7.encoder_attention.out_lin.bias", "decoder.layers.7.norm2.weight", "decoder.layers.7.norm2.bias", "decoder.layers.7.ffn.lin1.weight", "decoder.layers.7.ffn.lin1.bias", "decoder.layers.7.ffn.lin2.weight", "decoder.layers.7.ffn.lin2.bias", "decoder.layers.7.norm3.weight", "decoder.layers.7.norm3.bias". 
    size mismatch for embeddings.weight: copying a param with shape torch.Size([54946, 512]) from checkpoint, the shape in current model is torch.Size([54946, 300]).
    size mismatch for encoder.embeddings.weight: copying a param with shape torch.Size([54946, 512]) from checkpoint, the shape in current model is torch.Size([54946, 300]).
    size mismatch for encoder.position_embeddings.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([1024, 300]).
    size mismatch for encoder.layers.0.attention.q_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.0.attention.q_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.attention.k_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.0.attention.k_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.attention.v_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.0.attention.v_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.attention.out_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.0.attention.out_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.ffn.lin1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.0.ffn.lin1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.ffn.lin2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.0.ffn.lin2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.0.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.attention.q_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.1.attention.q_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.attention.k_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.1.attention.k_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.attention.v_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.1.attention.v_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.attention.out_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.1.attention.out_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.ffn.lin1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.1.ffn.lin1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.ffn.lin2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for encoder.layers.1.ffn.lin2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.layers.1.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for encoder.image_encoder.0.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([300, 2048]).
    size mismatch for encoder.image_encoder.0.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.embeddings.weight: copying a param with shape torch.Size([54946, 512]) from checkpoint, the shape in current model is torch.Size([54946, 300]).
    size mismatch for decoder.position_embeddings.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([1024, 300]).
    size mismatch for decoder.layers.0.self_attention.q_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.self_attention.q_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.self_attention.k_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.self_attention.k_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.self_attention.v_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.self_attention.v_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.self_attention.out_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.self_attention.out_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.encoder_attention.q_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.encoder_attention.q_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.encoder_attention.k_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.encoder_attention.k_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.encoder_attention.v_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.encoder_attention.v_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.encoder_attention.out_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.encoder_attention.out_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.ffn.lin1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.ffn.lin1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.ffn.lin2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.0.ffn.lin2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.norm3.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.0.norm3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.self_attention.q_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.self_attention.q_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.self_attention.k_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.self_attention.k_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.self_attention.v_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.self_attention.v_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.self_attention.out_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.self_attention.out_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.norm1.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.norm1.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.encoder_attention.q_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.encoder_attention.q_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.encoder_attention.k_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.encoder_attention.k_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.encoder_attention.v_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.encoder_attention.v_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.encoder_attention.out_lin.weight: copying a param with shape torch.Size([512, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.encoder_attention.out_lin.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.norm2.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.norm2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.ffn.lin1.weight: copying a param with shape torch.Size([2048, 512]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.ffn.lin1.bias: copying a param with shape torch.Size([2048]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.ffn.lin2.weight: copying a param with shape torch.Size([512, 2048]) from checkpoint, the shape in current model is torch.Size([300, 300]).
    size mismatch for decoder.layers.1.ffn.lin2.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.norm3.weight: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).
    size mismatch for decoder.layers.1.norm3.bias: copying a param with shape torch.Size([512]) from checkpoint, the shape in current model is torch.Size([300]).

facebookresearch / ParlAI

Unlikelihood Wizard Of Wikipedia Label Repetition Model: Error while fine tuning #3376