fastnlp / fastNLP

fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.
https://gitee.com/fastnlp/fastNLP
Apache License 2.0
3.05k stars 451 forks source link

0.5.5版本与0.6.0版本的bertembedding在自定义vocab后的参数量大小不同 #340

Open onebula opened 3 years ago

onebula commented 3 years ago

代码完全一致,都是自定义了vocab。

with open('checkpoints/vocab_2020-10-09-23-11-21.pickle', 'rb') as f:
    vocab = pickle.load(f)

tst_df = pd.read_csv('data/xxx.csv')
tst_data = DataSet(tst_df.to_dict(orient='list'))

tst_data.apply(lambda x: list(transform(x['event1'], f2h=False, fb=False)) + ['[SEP]'] + 
                                           list(transform(x['event2'], f2h=False, fb=False)), 
                         new_field_name='words', is_input=True)
tst_data.apply(lambda x: int(x['is_same']), new_field_name='target', is_target=True)

vocab.index_dataset(tst_data, field_name='words', new_field_name='words')    

embed = BertEmbedding(vocab, model_dir_or_name='models/hfl/chinese-roberta-wwm-ext', include_cls_sep=True, requires_grad=True, auto_truncate=True)
model1 = MultiTaskModel(embed)

model1.load_state_dict(torch.load('checkpoints/model_2020-10-09-23-11-21.pickle'))
model1.cuda()

但是更新了0.6.0版本后,0.5.5版本finetune好的bert模型无法加载,报维度不匹配的错误。

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-2cd663204d0b> in <module>
      2 model1 = MultiTaskModel(embed)
      3 
----> 4 model1.load_state_dict(torch.load('checkpoints/model_2020-10-09-23-11-21.pickle'))
      5 model1.cuda()

~/.pyenv/versions/3.6.8/envs/env-3.6.8/lib/python3.6/site-packages/torch/nn/modules/module.py in load_state_dict(self, state_dict, strict)
    843         if len(error_msgs) > 0:
    844             raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 845                                self.__class__.__name__, "\n\t".join(error_msgs)))
    846         return _IncompatibleKeys(missing_keys, unexpected_keys)
    847 

RuntimeError: Error(s) in loading state_dict for MultiTaskModel:
    size mismatch for bert.model.encoder.embeddings.word_embeddings.weight: copying a param with shape torch.Size([4688, 768]) from checkpoint, the shape in current model is torch.Size([21128, 768]).

检查了两个版本加载模型后的参数,发现是0.6.0版本的bertembedding是21128 768的emb,而在0.5.5版本finetune后的bert的emb大小是自定义vocab大小 768。 并且0.5.5版本在开始训练模型时有如下提示,0.6.0版本则没有

Start to generate word pieces for word.
160 words are unsegmented. Among them, 160 added to the BPE vocab.

0.6.0版本: image

0.5.5版本: image

之前我提过一个issue:https://github.com/fastnlp/fastNLP/issues/280 ,当时说的是“vocab中没有包含,但是bert包含的词会被删除”,现在是修改了这个逻辑么?辛苦抽空看看呀,感谢

yhcc commented 3 years ago

嗯,修改了。原来的逻辑容易导致在实际推断的时候无法利用上预训练BERT中的BPE,所以现在是不再删除没用上的word了。