bigdata-ustc / EduNLP

A library for advanced Natural Language Processing towards multi-modal educational items.
Apache License 2.0
50 stars 18 forks source link

[Bug] BertModel `add_special_tokens` and `resize_token_embeddings` problem #112

Closed pingzhili closed 2 years ago

pingzhili commented 2 years ago

🐛 Description

@nnnyt @KenelmQLH After I added TALEduBERT to our project and did some test, I found current get_pretrained_i2v function will return unmatched BertTokenizer and BertT2V (about special tokens). More specifically:

  1. In the initialization of BertTokenizer, tokens such as [FIGURE], [TAG] are added to self.tokenizer (which is huggingface tokenizer). in my case, it will increase the size of tokenizer, since there were no these tokens in TALEduBERT. So these tokens will be tokenized to ids out of the embedding layer range.
  2. Usually we have to do model.resize_token_embeddings(len(tokenizer)) after tokenizer.add_special_tokens(), and indeed there is one in Vector/t2v.BertModel (if tokenizer: self.model.resize_token_embeddings(len(tokenizer.tokenizer))). However, as @KenelmQLH required, "T2V has to be separated from tokenizer".
  3. Another point is, even if I upload a resized TALEduBERT to model-hub and pass the test, the '[FIGURE]' etc. tokens are still meaningless to model.

I've got two solutions here: 1) simply see these tokens as [UNK] in TALEduBERT, this may have to make some changes in BertTokenizer. 2) do resize_token_embeddings to the origin TALEduBERT, save and upload the new one to model-hub. 😢But both seems to be not so proper, what do you think?

Error Message

return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self

To Reproduce

http://base.ustc.edu.cn/data/model_zoo/modelhub/bert_pub/1/tal_edu_bert.zip I haven't push commits, you may download it and try yourself:)

What have you tried to solve it?

I ve stated in the Description

Environment

This has no relation with environment.

nnnyt commented 2 years ago

Cuz TALEduBERT is already pre-trained with educational data by others. It doesn't use a tokenizer with these special tokens during pre-training. It may be not suitable to pass a sentence with these specially-designed tokens into it or save a new one with our special tokens un-trained.

Here are my suggestions to solve this bug:

  1. I think it can be solved by simply adding a parameter add_special_tokens to BertTokenzier. When you want to use it for other pre-trained BERT-based models (e.g. TALEduBERT) directly, set add_special_tokens to False. Besides, when the parameter is False, we need to modify the code in BertTokenizer to avoid passing the items to our PureTextTokenizer.
  2. After the modify I suggest in 1, maybe we need to remove the code of resizing the embedding layer(if tokenizer: self.model.resize_token_embeddings(len(tokenizer.tokenizer))) in Vector/t2v.BertModel. When we set the parameter pretrained_model of BertModel with our pre-trained model (e.g. luna_bert), the embedding layer has already been resized during pre-training and been saved, so we can load it directly without resizing it again. For other models (e.g. TALEduBERT), if we don't add these special tokens as I suggest in 1, we don't need to resize it too.

Hope to know your opinion about my solutions. @pingzhiLi @KenelmQLH

pingzhili commented 2 years ago

agree with you

KenelmQLH commented 2 years ago

Yes, I agree with with @nnnyt . What we need to note is that : Specific pre-trained model should be consistent with specific pre-trained Tokenzier. Here are my advice @pingzhiLi .

  1. Check wheather TALEduBERT add special change for huggingface's BertTokenizer.
  2. It is recommend to seperate Pretrain/BertTokenizer and Vecor/BertModel .
  3. In EduNLP's BertTokenizer, pretrain_model="bert-base-chinese" is used only when we pretrain a Bert model; but when we use BertTokenizer for a pretrained model, pretrain_model(which include tokenzier config and vocab) is what we trained.

    If the initialization params are too different for luna-bert and tal-bert, I think you can consider add a class function like from_pretrained for EduNLP's BertTokenizer , which can work in I2V, and it is more simple for user when they use BertTokenizer .

By the way remember to rectify some examples and test after you seperate Tokenzier and T2V .such as

    >>> from EduNLP.Pretrain import BertTokenizer
    >>> tokenizer = BertTokenizer("bert-base-chinese")
    >>> model = BertModel("bert-base-chinese", tokenizer=tokenizer)