a question about self.vocab

benywon / ChineseBert

This is a chinese Bert model specific for question answering

27 stars 8 forks source link

a question about self.vocab #3

Open tomtang110 opened 5 years ago

tomtang110 commented 5 years ago

2019-01-17 18 13 06 Could you explain why you add self.vocab_size between question id and answer id?

benywon commented 5 years ago

Could you explain why you add self.vocab_size between question id and answer id?

The self.vocab_size is just a padding symbol to separate the question and the answer.

tomtang110 commented 5 years ago

想问下，你们训练的单词只是针对你们的word2id.obj文件吗？如果我自己建一套我自己的word2id可以使用你们的模型吗？主要我看了下词只有57777个感觉，有点少。

benywon commented 5 years ago

Definitely!! Different word2id would project the same word to a different id. So you should use my word2id.obj. BTW, 57777 words is not very small as we use the sentencepiece word tokenizer, so OOV is not a problem.

tomtang110 commented 5 years ago

But, I need more than 450000 words, 57777 for 450000 is few. It is so upset. Therefore, for most companies, I think Bert is still difficult for training, even, fine-tune.

benywon commented 5 years ago

Oh, that so bad, if you have your own vocab, this application may not suitable for you. Nevertheless, you can use my codes to train your own BERT.

tomtang110 commented 5 years ago

haha, But my company has no so affluent hardware equipment. My boss told me, next year, they would introduce cloud server, But at that time, I have finished my internship. Actually, I have made a machine comprehension reading system, I used the QAnet based on transformer as model, so I would like to try to use bert to train the Dureader dataset. However, it seems its value is too high to train it.