codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.11k stars 1.29k forks source link

when training the masked LM, the unmasked words (have label 0) were trained together with masked words? #29

Open coddinglxf opened 5 years ago

coddinglxf commented 5 years ago

According to the code

    def random_word(self, sentence):
        tokens = sentence.split()
        output_label = []

        for i, token in enumerate(tokens):
            prob = random.random()
            if prob < 0.15:
                # 80% randomly change token to make token
                if prob < prob * 0.8:
                    tokens[i] = self.vocab.mask_index

                # 10% randomly change token to random token
                elif prob * 0.8 <= prob < prob * 0.9:
                    tokens[i] = random.randrange(len(self.vocab))

                # 10% randomly change token to current token
                elif prob >= prob * 0.9:
                    tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)

                output_label.append(self.vocab.stoi.get(token, self.vocab.unk_index))

            else:
                tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index)
                output_label.append(0)

        return tokens, output_label

Do we need to exclude the unmasked words when training the LM?

codertimo commented 5 years ago

@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation

coddinglxf commented 5 years ago

got it,thanks. however,when batch size is large,the final LM output will be “batch_size seq_len vacab_size”, such matrix is too big. maybe we can record the index of masked works when preprocessing the text, and use index technology to save memory.

codertimo commented 5 years ago

@coddinglxf that's what I thought at first, but can't implement it efficiently as much as GPU computation time. If you have any idea please implement and pull request plez :) It would be really cool to do it 👍

coddinglxf commented 5 years ago

@codertimo the output of transformer block will be : (batch, seq, hidden)

step 1: we can reshape the transformer output as : (batch*seq, hidden)

step 2: index the masked words for 0-th sentence, the i-th masked word will have a index: 0 seq + i for j-th sentence, the i-th masked word will have a index: j seq + i

combine step1 and step2: we can get the hidden output of masked words (using fancy indexing)

the question is we can not do this with multi-gpu

leon-cas commented 5 years ago

@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0) which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation

why here output_label is 0, but not "tokens[i]"? if we set output_label =0, that means, just 15% of training data is used to train MaskedLM? tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(0)

codertimo commented 5 years ago

@leon-cas yes #36 it's solved with your question