RowitZou / topic-dialog-summ

AAAI-2021 paper: Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling.
MIT License
77 stars 9 forks source link

请问这个维度对齐问题是否在您的代码中出现过?有空看看这个bug吗谢谢 #5

Closed PYMAQ closed 2 years ago

PYMAQ commented 3 years ago

(pytorch17) D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main>python ./src/train.py -data_path bert_data/ali -bert_dir bert/chinese_bert -log_file logs/pipeline.topic.train.log -sep_optim -topic_model -split_noise -pretrain -model_path models/pipeline_topic [2021-05-16 04:35:28,785 INFO] Namespace(accum_count=2, agent=True, alpha=0.6, batch_ex_size=4, batch_size=2000, beam_size=3, bert_dir='bert/chinese_bert', beta1=0.9, beta2=0.999, block _trigram=True, copy_attn=False, copy_attn_force=False, copy_loss_by_seqlength=False, coverage=False, cust=True, data_path='bert_data/ali', dec_dropout=0.2, dec_ff_size=2048, dec_heads=8 , dec_hidden_size=768, dec_layers=3, decoder='transformer', enc_dropout=0.2, enc_ff_size=2048, enc_heads=8, enc_hidden_size=768, enc_layers=3, encoder='bert', ex_max_token_num=500, fine tune_bert=True, freeze_step=500, generator_shard_size=32, gpu_ranks=[0], hier_dropout=0.2, hier_ff_size=2048, hier_heads=8, hier_hidden_size=768, hier_layers=2, idf_info_path='bert_data /idf_info.pt', label_smoothing=0.1, log_file='logs/pipeline.topic.train.log', loss_lambda=0.001, lr=0.001, lr_bert=0.001, lr_other=0.01, lr_topic=0.0001, mask_token_prob=0.15, maxgrad norm=0, max_length=100, max_pos=512, max_tgt_len=100, max_word_count=6000, min_length=10, min_word_count=5, mode='train', model_path='models/pipeline_topic', noise_rate=0.5, optim='adam ', pn_dropout=0.2, pn_ff_size=2048, pn_heads=8, pn_hidden_size=768, pn_layers=2, pretrain=True, pretrain_steps=80000, report_every=5, result_path='results/ali', save_checkpoint_steps=20 00, seed=666, select_sent_prob=0.9, sent_dropout=0.2, sent_ff_size=2048, sent_heads=8, sent_hidden_size=768, sent_layers=3, sep_optim=True, share_emb=True, split_noise=True, src_data_mo de='utt', test_all=False, test_batch_ex_size=50, test_batch_size=20000, test_from='', test_mode='abs', test_start_from=-1, tokenize=True, topic_model=True, topic_num=50, train_from='', train_from_ignore_optim=False, train_steps=80000, use_idf=False, visible_gpus='0', warmup=True, warmup_steps=5000, warmup_steps_bert=5000, warmup_steps_other=5000, word_emb_mode='word2v ec', word_emb_path='pretrain_emb/word2vec', word_emb_size=100, world_size=1) [2021-05-16 04:35:28,786 INFO] Device ID 0 [2021-05-16 04:35:28,786 INFO] Device cuda [2021-05-16 04:35:28,808 INFO] Model name 'bert/chinese_bert' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base- multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-unc ased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc). Assuming 'bert/chinese_bert' is a path or url to a directo ry containing tokenizer files. [2021-05-16 04:35:28,808 INFO] Didn't find file bert/chinese_bert\added_tokens.json. We won't load it. [2021-05-16 04:35:28,808 INFO] Didn't find file bert/chinese_bert\special_tokens_map.json. We won't load it. [2021-05-16 04:35:28,808 INFO] loading file bert/chinese_bert\vocab.txt [2021-05-16 04:35:28,808 INFO] loading file None [2021-05-16 04:35:28,808 INFO] loading file None [2021-05-16 04:35:28,907 INFO] loading configuration file bert/chinese_bert\config.json [2021-05-16 04:35:28,907 INFO] Model config { "attention_probs_dropout_prob": 0.1, "directionality": "bidi", "finetuning_task": null, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "initializer_range": 0.02, "intermediate_size": 4096, "layer_norm_eps": 1e-12, "max_position_embeddings": 512, "num_attention_heads": 16, "num_hidden_layers": 24, "num_labels": 2, "output_attentions": false, "output_hidden_states": false, "pooler_fc_size": 768, "pooler_num_attention_heads": 12, "pooler_num_fc_layers": 3, "pooler_size_per_head": 128, "pooler_type": "first_token_transform", "torchscript": false, "type_vocab_size": 2, "vocab_size": 21128 }

[2021-05-16 04:35:28,908 INFO] loading weights file bert/chinese_bert\pytorch_model.bin [2021-05-16 04:35:34,504 INFO] loading Word2VecKeyedVectors object from pretrain_emb/word2vec [2021-05-16 04:35:34,504 INFO] setting ignored attribute vectors_norm to None [2021-05-16 04:35:34,505 INFO] loaded pretrain_emb/word2vec [2021-05-16 04:35:37,689 INFO] Model( (embeddings): Embedding(21128, 1024, padding_idx=0) (encoder): Bert( (model): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(21128, 1024, padding_idx=0) (position_embeddings): Embedding(512, 1024) (token_type_embeddings): Embedding(2, 1024) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (12): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (13): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (14): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (15): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (16): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (17): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (18): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (19): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (20): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (21): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (22): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (23): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=1024, out_features=1024, bias=True) (key): Linear(in_features=1024, out_features=1024, bias=True) (value): Linear(in_features=1024, out_features=1024, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=1024, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=1024, out_features=4096, bias=True) ) (output): BertOutput( (dense): Linear(in_features=4096, out_features=1024, bias=True) (LayerNorm): BertLayerNorm() (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=1024, out_features=1024, bias=True) (activation): Tanh() ) ) ) (sent_encoder): TransformerEncoder( (pos_emb): PositionalEncoding( (dropout): Dropout(p=0.2, inplace=False) ) (transformer): ModuleList( (0): TransformerEncoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout): Dropout(p=0.2, inplace=False) ) (1): TransformerEncoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout): Dropout(p=0.2, inplace=False) ) (2): TransformerEncoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout): Dropout(p=0.2, inplace=False) ) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) ) (hier_encoder): TransformerEncoder( (pos_emb): PositionalEncoding( (dropout): Dropout(p=0.2, inplace=False) ) (transformer): ModuleList( (0): TransformerEncoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout): Dropout(p=0.2, inplace=False) ) (1): TransformerEncoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout): Dropout(p=0.2, inplace=False) ) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) ) (pn_decoder): TransformerDecoder( (transformer_layers): ModuleList( (0): TransformerDecoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (context_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) (linear_topic_keys): Linear(in_features=768, out_features=768, bias=True) (linear_topic_vecs): Linear(in_features=300, out_features=768, bias=True) (linear_topic_w): Linear(in_features=2304, out_features=8, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (layer_norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (drop): Dropout(p=0.2, inplace=False) ) (1): TransformerDecoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (context_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) (linear_topic_keys): Linear(in_features=768, out_features=768, bias=True) (linear_topic_vecs): Linear(in_features=300, out_features=768, bias=True) (linear_topic_w): Linear(in_features=2304, out_features=8, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (layer_norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (drop): Dropout(p=0.2, inplace=False) ) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) ) (pn_generator): PointerNetGenerator( (linear_dec): Linear(in_features=768, out_features=768, bias=True) (linear_mem): Linear(in_features=768, out_features=768, bias=True) (score_linear): Linear(in_features=768, out_features=1, bias=True) (tanh): Tanh() (softmax): LogSoftmax(dim=-1) ) (decoder): TransformerDecoder( (embeddings): Embedding(21128, 1024, padding_idx=0) (pos_emb): PositionalEncoding( (dropout): Dropout(p=0.2, inplace=False) ) (transformer_layers): ModuleList( (0): TransformerDecoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (context_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) (linear_topic_keys): Linear(in_features=768, out_features=768, bias=True) (linear_topic_vecs): Linear(in_features=300, out_features=768, bias=True) (linear_topic_w): Linear(in_features=2304, out_features=8, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (layer_norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (drop): Dropout(p=0.2, inplace=False) ) (1): TransformerDecoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (context_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) (linear_topic_keys): Linear(in_features=768, out_features=768, bias=True) (linear_topic_vecs): Linear(in_features=300, out_features=768, bias=True) (linear_topic_w): Linear(in_features=2304, out_features=8, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (layer_norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (drop): Dropout(p=0.2, inplace=False) ) (2): TransformerDecoderLayer( (self_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) ) (context_attn): MultiHeadedAttention( (linear_keys): Linear(in_features=768, out_features=768, bias=True) (linear_values): Linear(in_features=768, out_features=768, bias=True) (linear_query): Linear(in_features=768, out_features=768, bias=True) (softmax): Softmax(dim=-1) (dropout): Dropout(p=0.2, inplace=False) (final_linear): Linear(in_features=768, out_features=768, bias=True) (linear_topic_keys): Linear(in_features=768, out_features=768, bias=True) (linear_topic_vecs): Linear(in_features=300, out_features=768, bias=True) (linear_topic_w): Linear(in_features=2304, out_features=8, bias=True) ) (feed_forward): PositionwiseFeedForward( (w_1): Linear(in_features=768, out_features=2048, bias=True) (w_2): Linear(in_features=2048, out_features=768, bias=True) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (dropout_1): Dropout(p=0.2, inplace=False) (dropout_2): Dropout(p=0.2, inplace=False) ) (layer_norm_1): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (layer_norm_2): LayerNorm((768,), eps=1e-06, elementwise_affine=True) (drop): Dropout(p=0.2, inplace=False) ) ) (layer_norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True) ) (generator): Generator( (linear): Linear(in_features=768, out_features=21128, bias=True) (softmax): LogSoftmax(dim=-1) ) (topic_model): MultiTopicModel( (tm1): TopicModel( (mlp): Sequential( (0): Linear(in_features=9, out_features=200, bias=True) (1): Tanh() ) (mu_linear): Linear(in_features=200, out_features=100, bias=True) (sigma_linear): Linear(in_features=200, out_features=100, bias=True) (theta_linear): Linear(in_features=100, out_features=50, bias=True) ) (tm2): TopicModel( (mlp): Sequential( (0): Linear(in_features=9, out_features=200, bias=True) (1): Tanh() ) (mu_linear): Linear(in_features=200, out_features=100, bias=True) (sigma_linear): Linear(in_features=200, out_features=100, bias=True) (theta_linear): Linear(in_features=100, out_features=50, bias=True) ) (tm3): TopicModel( (mlp): Sequential( (0): Linear(in_features=9, out_features=200, bias=True) (1): Tanh() ) (mu_linear): Linear(in_features=200, out_features=100, bias=True) (sigma_linear): Linear(in_features=200, out_features=100, bias=True) (theta_linear): Linear(in_features=100, out_features=50, bias=True) ) ) (topic_gate_linear_summ): Linear(in_features=1068, out_features=300, bias=True) (topic_emb_linear_summ): Linear(in_features=768, out_features=300, bias=True) (topic_gate_linear_noise): Linear(in_features=1068, out_features=300, bias=True) (topic_emb_linear_noise): Linear(in_features=768, out_features=300, bias=True) ) gpu_rank 0 [2021-05-16 04:35:37,717 INFO] number of parameters: 420789075 [2021-05-16 04:35:37,718 INFO] Start training... [2021-05-16 04:35:37,719 INFO] Loading train dataset from bert_data\ali.train.0.bert.pt, number of examples: 8 idf_info {'all': Counter({0: 32, 1: 32, 2: 32, 3: 32, 4: 32, 5: 32, 6: 32, 7: 32, 8: 32}), 'customer': Counter({6: 32, 7: 32, 8: 32}), 'agent': Counter({0: 32, 1: 32, 2: 32, 3: 32, 4: 3 2, 5: 32}), 'num': 32, 'voc_size': 9} vocab_size 9 all_bow tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.]]) idf_info {'all': Counter({0: 32, 1: 32, 2: 32, 3: 32, 4: 32, 5: 32, 6: 32, 7: 32, 8: 32}), 'customer': Counter({6: 32, 7: 32, 8: 32}), 'agent': Counter({0: 32, 1: 32, 2: 32, 3: 32, 4: 3 2, 5: 32}), 'num': 32, 'voc_size': 9} vocab_size 9 all_bow tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0.]]) emb torch.Size([8, 11, 1024]) emb torch.Size([8, 11, 1024]) pos torch.Size([1, 11, 768]) Traceback (most recent call last): File "./src/train.py", line 163, in train(args, device_id) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\train_abstractive.py", line 359, in train train_single(args, device_id) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\train_abstractive.py", line 423, in train_single trainer.train(train_iter_fct, args.pretrain_steps) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\models\rl_model_trainer.py", line 172, in train report_stats, step) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\models\rl_model_trainer.py", line 201, in _gradient_calculation pn_output, decode_output, topicloss, = self.model.pretrain(batch) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\models\rl_model.py", line 761, in pretrain sent_hid = self.sent_encoder(src_emb, ~mask_src)[:, 0, :] File "C:\Users\86173.conda\envs\pytorch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, *kwargs) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\models\encoder.py", line 247, in forward x = self.pos_emb(top_vecs) File "C:\Users\86173.conda\envs\pytorch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\models\encoder.py", line 74, in forward emb = emb + pos RuntimeError: The size of tensor a (1024) must match the size of tensor b (768) at non-singleton dimension 2

RowitZou commented 3 years ago

因为当时我们使用的 word embedding 维度与模型的 hidden size 相同,都为768,所以没有问题。 这里你的 word embedding 维度是1024,需要在输入 sent_encoder 之前,过一个线性层将它映射成 768 维度。

PYMAQ commented 3 years ago

因为当时我们使用的 word embedding 维度与模型的 hidden size 相同,都为768,所以没有问题。 这里你的 word embedding 维度是1024,需要在输入 sent_encoder 之前,过一个线性层将它映射成 768 维度。

word embedding 维度是可以设置的吗?怎么在train_emb.py文件里面没看到,在哪里过一个线性层呢?谢谢 请问是在哪个py文件里改呢?不吝赐教,谢谢!

PYMAQ commented 3 years ago

怎么修改word embedding的维度?我没改代码,直接按照md文件说明,跑一份少样本的中文数据集。跑到这里卡住了。感谢指教。

RowitZou commented 3 years ago

你应该是使用了自己的 embedding 文件,因为源代码的 word embeding 是随机初始化的,能保证维度的对齐。train_emb.py 得到的 embedding 是为了训练主题模型。

具体的改动:你需要在所有的 src_emb = self.embeddings(src) 语句后面加一个线性层映射。文件是 rl_model.py

PYMAQ commented 3 years ago

搞定了,原来我用的是1024的bert,该用768的base就可以了。感谢大佬指点!!

PYMAQ commented 3 years ago

请问,训练完个简单的模型后,跑出来mode=test的时候(测试),出现下面bug,请问有办法吗? self.load_state_dict(checkpoint['model'], strict=False)定位这里好像有问题??大佬有什么思路吗谢谢!

[2021-05-16 20:00:56,165 INFO] setting ignored attribute vectors_norm to None [2021-05-16 20:00:56,165 INFO] loaded pretrain_emb/word2vec Traceback (most recent call last): File "./src/train.py", line 177, in test_text(args, device_id, cp, step) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\train_abstractive.py", line 345, in test_text model = Summarizer(args, device, tokenizer.vocab, checkpoint) File "D:\00CC_code\topic-dialog-summ-main\topic-dialog-summ-main\src\models\rl_model.py", line 111, in init self.load_state_dict(checkpoint['model'], strict=False) File "C:\Users\86173.conda\envs\pytorch17\lib\site-packages\torch\nn\modules\module.py", line 1052, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Model: size mismatch for encoder.model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([21128, 768]) from checkpoint, the shape in current model is torch.Size( [21128, 1024]). size mismatch for encoder.model.embeddings.position_embeddings.weight: copying a param with shape torch.Size([512, 768]) from checkpoint, the shape in current model is torch.Siz e([512, 1024]). size mismatch for encoder.model.embeddings.token_type_embeddings.weight: copying a param with shape torch.Size([2, 768]) from checkpoint, the shape in current model is torch.Siz e([2, 1024]). size mismatch for encoder.model.embeddings.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for encoder.model.embeddings.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for encoder.model.encoder.layer.0.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.0.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.0.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.0.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.0.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.0.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.0.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.0.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.0.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.0.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.0.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.0.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.0.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.0.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.0.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.0.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.1.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.1.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.1.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.1.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.1.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.1.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.1.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.1.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.1.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.1.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.1.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.1.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.1.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.1.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.1.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.1.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.2.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.2.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.2.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.2.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.2.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.2.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.2.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.2.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.2.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.2.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.2.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.2.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.2.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.2.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.2.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.2.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.3.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.3.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.3.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.3.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.3.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.3.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.3.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.3.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.3.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.3.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.3.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.3.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.3.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.3.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.3.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.3.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.4.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.4.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.4.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.4.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.4.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.4.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.4.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.4.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.4.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.4.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.4.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.4.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.4.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.4.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.4.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.4.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.5.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.5.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.5.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.5.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.5.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.5.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.5.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.5.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.5.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.5.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.5.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.5.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.5.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.5.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.5.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.5.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.6.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.6.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.6.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.6.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.6.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.6.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.6.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.6.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.6.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.6.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.6.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.6.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.6.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.6.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.6.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.6.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.7.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.7.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.7.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.7.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.7.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.7.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.7.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.7.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.7.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.7.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.7.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.7.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.7.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.7.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.7.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.7.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.8.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.8.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.8.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.8.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.8.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.8.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.8.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.8.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.8.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.8.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.8.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.8.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.8.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.8.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.8.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.8.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.9.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.9.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.9.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch .Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.9.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.9.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is tor ch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.9.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size ([1024]). size mismatch for encoder.model.encoder.layer.9.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is t orch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.9.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Si ze([1024]). size mismatch for encoder.model.encoder.layer.9.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is to rch.Size([1024]). size mismatch for encoder.model.encoder.layer.9.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torc h.Size([1024]). size mismatch for encoder.model.encoder.layer.9.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is torc h.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.9.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size( [4096]). size mismatch for encoder.model.encoder.layer.9.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Size ([1024, 4096]). size mismatch for encoder.model.encoder.layer.9.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]) . size mismatch for encoder.model.encoder.layer.9.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([ 1024]). size mismatch for encoder.model.encoder.layer.9.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([10 24]). size mismatch for encoder.model.encoder.layer.10.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is to rch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.10.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Siz e([1024]). size mismatch for encoder.model.encoder.layer.10.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torc h.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.10.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size( [1024]). size mismatch for encoder.model.encoder.layer.10.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is to rch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.10.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Siz e([1024]). size mismatch for encoder.model.encoder.layer.10.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.10.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.S ize([1024]). size mismatch for encoder.model.encoder.layer.10.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is t orch.Size([1024]). size mismatch for encoder.model.encoder.layer.10.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is tor ch.Size([1024]). size mismatch for encoder.model.encoder.layer.10.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is tor ch.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.10.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size ([4096]). size mismatch for encoder.model.encoder.layer.10.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Siz e([1024, 4096]). size mismatch for encoder.model.encoder.layer.10.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024] ). size mismatch for encoder.model.encoder.layer.10.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size( [1024]). size mismatch for encoder.model.encoder.layer.10.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1 024]). size mismatch for encoder.model.encoder.layer.11.attention.self.query.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is to rch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.11.attention.self.query.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Siz e([1024]). size mismatch for encoder.model.encoder.layer.11.attention.self.key.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torc h.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.11.attention.self.key.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size( [1024]). size mismatch for encoder.model.encoder.layer.11.attention.self.value.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is to rch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.11.attention.self.value.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Siz e([1024]). size mismatch for encoder.model.encoder.layer.11.attention.output.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for encoder.model.encoder.layer.11.attention.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.S ize([1024]). size mismatch for encoder.model.encoder.layer.11.attention.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is t orch.Size([1024]). size mismatch for encoder.model.encoder.layer.11.attention.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is tor ch.Size([1024]). size mismatch for encoder.model.encoder.layer.11.intermediate.dense.weight: copying a param with shape torch.Size([3072, 768]) from checkpoint, the shape in current model is tor ch.Size([4096, 1024]). size mismatch for encoder.model.encoder.layer.11.intermediate.dense.bias: copying a param with shape torch.Size([3072]) from checkpoint, the shape in current model is torch.Size ([4096]). size mismatch for encoder.model.encoder.layer.11.output.dense.weight: copying a param with shape torch.Size([768, 3072]) from checkpoint, the shape in current model is torch.Siz e([1024, 4096]). size mismatch for encoder.model.encoder.layer.11.output.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024] ). size mismatch for encoder.model.encoder.layer.11.output.LayerNorm.weight: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size( [1024]). size mismatch for encoder.model.encoder.layer.11.output.LayerNorm.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1 024]). size mismatch for encoder.model.pooler.dense.weight: copying a param with shape torch.Size([768, 768]) from checkpoint, the shape in current model is torch.Size([1024, 1024]). size mismatch for encoder.model.pooler.dense.bias: copying a param with shape torch.Size([768]) from checkpoint, the shape in current model is torch.Size([1024]). size mismatch for decoder.embeddings.weight: copying a param with shape torch.Size([21128, 768]) from checkpoint, the shape in current model is torch.Size([21128, 1024]). size mismatch for decoder.pos_emb.pe: copying a param with shape torch.Size([1, 5000, 768]) from checkpoint, the shape in current model is torch.Size([1, 5000, 1024]). size mismatch for generator.linear.weight: copying a param with shape torch.Size([21128, 768]) from checkpoint, the shape in current model is torch.Size([21128, 1024]).