0.5.5版本与0.6.0版本的bertembedding在自定义vocab后的参数量大小不同

代码完全一致，都是自定义了vocab。

with open('checkpoints/vocab_2020-10-09-23-11-21.pickle', 'rb') as f:
    vocab = pickle.load(f)

tst_df = pd.read_csv('data/xxx.csv')
tst_data = DataSet(tst_df.to_dict(orient='list'))

tst_data.apply(lambda x: list(transform(x['event1'], f2h=False, fb=False)) + ['[SEP]'] + 
                                           list(transform(x['event2'], f2h=False, fb=False)), 
                         new_field_name='words', is_input=True)
tst_data.apply(lambda x: int(x['is_same']), new_field_name='target', is_target=True)

vocab.index_dataset(tst_data, field_name='words', new_field_name='words')    

embed = BertEmbedding(vocab, model_dir_or_name='models/hfl/chinese-roberta-wwm-ext', include_cls_sep=True, requires_grad=True, auto_truncate=True)
model1 = MultiTaskModel(embed)

model1.load_state_dict(torch.load('checkpoints/model_2020-10-09-23-11-21.pickle'))
model1.cuda()

但是更新了0.6.0版本后，0.5.5版本finetune好的bert模型无法加载，报维度不匹配的错误。

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-2cd663204d0b> in <module>
      2 model1 = MultiTaskModel(embed)
      3 
----> 4 model1.load_state_dict(torch.load('checkpoints/model_2020-10-09-23-11-21.pickle'))
      5 model1.cuda()

~/.pyenv/versions/3.6.8/envs/env-3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    843         if len(error_msgs) > 0:
    844             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 845                                self.__class__.__name__, "\n\t".join(error_msgs)))
    846         return _IncompatibleKeys(missing_keys, unexpected_keys)
    847 

RuntimeError: Error(s) in loading state_dict for MultiTaskModel:
    size mismatch for bert.model.encoder.embeddings.word_embeddings.weight: copying a param with shape torch.Size([4688, 768]) from checkpoint, the shape in current model is torch.Size([21128, 768]).

检查了两个版本加载模型后的参数，发现是0.6.0版本的bertembedding是21128 768的emb，而在0.5.5版本finetune后的bert的emb大小是自定义vocab大小 768。并且0.5.5版本在开始训练模型时有如下提示，0.6.0版本则没有

Start to generate word pieces for word.
160 words are unsegmented. Among them, 160 added to the BPE vocab.

0.6.0版本：

0.5.5版本：

之前我提过一个issue：https://github.com/fastnlp/fastNLP/issues/280 ，当时说的是“vocab中没有包含，但是bert包含的词会被删除”，现在是修改了这个逻辑么？辛苦抽空看看呀，感谢

fastnlp / fastNLP

0.5.5版本与0.6.0版本的bertembedding在自定义vocab后的参数量大小不同 #340