Closed RyanHuangNLP closed 4 years ago
Hi, I have met this problem. Have you solved it? If so, please help me! @RyanHuangNLP
fairseq is not so friendly in common usage... I am wondering how to integrate other Chinese PLM models into fairseq framework too...
My implementation for Chinese bert, may it helps you. https://github.com/CheungZeeCn/fairseq/blob/master/Sentence_prediction_Chinese_bert_demo.md @RyanHuangNLP @zhazhuanling
I implemented Chinese-BERT-wwm/Chinese-RoBERTa-wwm to Fariseq. I performed minimal changes on Fairseq to let it be compatible with (Chinese) BERT based tasks, and provided an option to Chinese Whole Word Masking strategy. https://github.com/Once2gain/Chinese-RoBERTa-wwm-Fairseq It could help you.
❓ Questions and Help
Before asking:
I want to train a chinese roberta, following the pretrain tutorial, I was wonder how to generate
https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json
、https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe
、https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt
these three file in chineseI found that the vocab file is different from the one in google's bert
I have try to change script
multiprocessing_bpe_encoder.py
classMultiprocessingEncoder
to use theBertTokenizer
, the changing is in the following code, and preprocess the corpus with the google's vocab.txt.but while I begin to train the roberta, the vocab embedding become
[21133, 768]
, but the google's vocab.txt just have 21128 tokens, fairseq will add special 5 token ? when I convert the weight to the transformer one, I found that tokens are not equalAnother question
I also wonder how to initial the weight by other pretrain chinese bert, because of the token embedding size not equal(21133 != 21128) and max position embedding is also not equal(514 != 512)
Code
What have you tried?
What's your environment?
pip
, source):