How to train a chinese roberta

RyanHuangNLP commented 4 years ago

❓ Questions and Help

Before asking:

search the issues.
search the docs.

I want to train a chinese roberta, following the pretrain tutorial, I was wonder how to generate `https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/encoder.json`、`https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/vocab.bpe`、`https://dl.fbaipublicfiles.com/fairseq/gpt2_bpe/dict.txt` these three file in chinese

I found that the vocab file is different from the one in google's bert

I have try to change script multiprocessing_bpe_encoder.py class MultiprocessingEncoder to use the BertTokenizer, the changing is in the following code, and preprocess the corpus with the google's vocab.txt.

but while I begin to train the roberta, the vocab embedding become [21133, 768], but the google's vocab.txt just have 21128 tokens, fairseq will add special 5 token ? when I convert the weight to the transformer one, I found that tokens are not equal

Another question

I also wonder how to initial the weight by other pretrain chinese bert, because of the token embedding size not equal(21133 != 21128) and max position embedding is also not equal(514 != 512)

Code

class MultiprocessingEncoder(object):

    def __init__(self, args):
        self.args = args

    def initializer(self):
        global bpe
        bpe = BertTokenizer.from_pretrained('bert-base-uncased')

    def encode(self, line):
        global bpe
        subword = bpe._tokenize(line)
        return subword

    def decode(self, tokens):
        global bpe
        return bpe.decode(tokens)

    def encode_lines(self, lines):
        """
        Encode a set of lines. All lines will be encoded together.
        """
        enc_lines = []
        for line in lines:
            line = line.strip()
            if len(line) == 0 and not self.args.keep_empty:
                return ["EMPTY", None]
            tokens = self.encode(line)
            enc_lines.append(" ".join(tokens))
        return ["PASS", enc_lines]

    def decode_lines(self, lines):
        dec_lines = []
        for line in lines:
            tokens = map(int, line.strip().split())
            dec_lines.append(self.decode(tokens))
        return ["PASS", dec_lines]

What have you tried?

What's your environment?

fairseq Version (e.g., 1.0 or master):
PyTorch Version (e.g., 1.0)
OS (e.g., Linux):
How you installed fairseq (pip, source):
Build command you used (if compiling from source):
Python version:
CUDA/cuDNN version:
GPU models and configuration:
Any other relevant information:

zhazhuanling commented 4 years ago

Hi, I have met this problem. Have you solved it? If so, please help me! @RyanHuangNLP

CheungZeeCn commented 4 years ago

fairseq is not so friendly in common usage... I am wondering how to integrate other Chinese PLM models into fairseq framework too...

CheungZeeCn commented 3 years ago

My implementation for Chinese bert, may it helps you. https://github.com/CheungZeeCn/fairseq/blob/master/Sentence_prediction_Chinese_bert_demo.md @RyanHuangNLP @zhazhuanling

Once2gain commented 11 months ago

I implemented Chinese-BERT-wwm/Chinese-RoBERTa-wwm to Fariseq. I performed minimal changes on Fairseq to let it be compatible with (Chinese) BERT based tasks, and provided an option to Chinese Whole Word Masking strategy. https://github.com/Once2gain/Chinese-RoBERTa-wwm-Fairseq It could help you.

facebookresearch / fairseq