Open tomtang110 opened 5 years ago
Could you explain why you add self.vocab_size between question id and answer id?
The self.vocab_size is just a padding symbol to separate the question and the answer.
想问下,你们训练的单词只是针对你们的word2id.obj文件吗? 如果我自己建一套我自己的word2id可以使用你们的模型吗?主要我看了下词只有57777个感觉,有点少。
Definitely!! Different word2id would project the same word to a different id. So you should use my word2id.obj. BTW, 57777 words is not very small as we use the sentencepiece word tokenizer, so OOV is not a problem.
But, I need more than 450000 words, 57777 for 450000 is few. It is so upset. Therefore, for most companies, I think Bert is still difficult for training, even, fine-tune.
Oh, that so bad, if you have your own vocab, this application may not suitable for you. Nevertheless, you can use my codes to train your own BERT.
haha, But my company has no so affluent hardware equipment. My boss told me, next year, they would introduce cloud server, But at that time, I have finished my internship. Actually, I have made a machine comprehension reading system, I used the QAnet based on transformer as model, so I would like to try to use bert to train the Dureader dataset. However, it seems its value is too high to train it.
Could you explain why you add self.vocab_size between question id and answer id?