Closed yyht closed 6 years ago
Hi. it's too few corpus to train on pretrain stage. i think you need millions sentences, at least one million. it's easy to get raw data for pretrain stage, as long as each line contains a document or sentence(s).
it's also common sense to use lots of corpus to train on word embedding, same apply to pretrain language model.
let me know result after using lots of data for pretrain masked language model.
Hi, I tried your bert_model rather than bert_cnn_model. Bert_model could get about 75% F1 score on language model task. But using the pretrained bert_model to finetune on classification task, it didn't work. F1 score was still about 10% after several epoches. It is something wrong with bert_model?
Hi, I tried to use your bert_cnn_model to train my corpus, 90W sentences and 30w words, the average length of sentence is 30 after tokenization. But the model seems to stuck on local minial that the accuracy on validation set just fluctuates after first 5-epoch