Closed pingzhili closed 2 years ago
Cuz TALEduBERT is already pre-trained with educational data by others. It doesn't use a tokenizer with these special tokens during pre-training. It may be not suitable to pass a sentence with these specially-designed tokens into it or save a new one with our special tokens un-trained.
Here are my suggestions to solve this bug:
add_special_tokens
to BertTokenzier
. When you want to use it for other pre-trained BERT-based models (e.g. TALEduBERT
) directly, set add_special_tokens
to False
. Besides, when the parameter is False
, we need to modify the code in BertTokenizer
to avoid passing the items to our PureTextTokenizer
.if tokenizer: self.model.resize_token_embeddings(len(tokenizer.tokenizer))
) in Vector/t2v.BertModel
. When we set the parameter pretrained_model
of BertModel
with our pre-trained model (e.g. luna_bert), the embedding layer has already been resized during pre-training and been saved, so we can load it directly without resizing it again. For other models (e.g. TALEduBERT), if we don't add these special tokens as I suggest in 1, we don't need to resize it too.Hope to know your opinion about my solutions. @pingzhiLi @KenelmQLH
agree with you
Yes, I agree with with @nnnyt . What we need to note is that : Specific pre-trained model should be consistent with specific pre-trained Tokenzier. Here are my advice @pingzhiLi .
TALEduBERT
add special change for huggingface's BertTokenizer
.Pretrain/BertTokenizer
and Vecor/BertModel
.EduNLP's BertTokenizer
, pretrain_model="bert-base-chinese"
is used only when we pretrain a Bert model; but when we use BertTokenizer
for a pretrained model, pretrain_model
(which include tokenzier config and vocab) is what we trained.
If the initialization params are too different for luna-bert and tal-bert, I think you can consider add a class function like
from_pretrained
forEduNLP's BertTokenizer
, which can work in I2V, and it is more simple for user when they use BertTokenizer .
By the way remember to rectify some examples and test after you seperate Tokenzier
and T2V
.such as
>>> from EduNLP.Pretrain import BertTokenizer
>>> tokenizer = BertTokenizer("bert-base-chinese")
>>> model = BertModel("bert-base-chinese", tokenizer=tokenizer)
🐛 Description
@nnnyt @KenelmQLH After I added TALEduBERT to our project and did some test, I found current
get_pretrained_i2v
function will return unmatched BertTokenizer and BertT2V (about special tokens). More specifically:[FIGURE]
,[TAG]
are added toself.tokenizer
(which is huggingface tokenizer). in my case, it will increase the size of tokenizer, since there were no these tokens in TALEduBERT. So these tokens will be tokenized to ids out of the embedding layer range.model.resize_token_embeddings(len(tokenizer))
aftertokenizer.add_special_tokens()
, and indeed there is one inVector/t2v.BertModel
(if tokenizer: self.model.resize_token_embeddings(len(tokenizer.tokenizer))
). However, as @KenelmQLH required, "T2V has to be separated from tokenizer".I've got two solutions here: 1) simply see these tokens as
[UNK]
in TALEduBERT, this may have to make some changes inBertTokenizer
. 2) doresize_token_embeddings
to the origin TALEduBERT, save and upload the new one to model-hub. 😢But both seems to be not so proper, what do you think?Error Message
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self
To Reproduce
http://base.ustc.edu.cn/data/model_zoo/modelhub/bert_pub/1/tal_edu_bert.zip I haven't push commits, you may download it and try yourself:)
What have you tried to solve it?
I ve stated in the Description
Environment
This has no relation with environment.