Open coddinglxf opened 5 years ago
@coddinglxf I just solved that problem with nn.NLLLoss(ignore_index=0)
which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation
got it,thanks. however,when batch size is large,the final LM output will be “batch_size seq_len vacab_size”, such matrix is too big. maybe we can record the index of masked works when preprocessing the text, and use index technology to save memory.
@coddinglxf that's what I thought at first, but can't implement it efficiently as much as GPU computation time. If you have any idea please implement and pull request plez :) It would be really cool to do it 👍
@codertimo the output of transformer block will be : (batch, seq, hidden)
step 1: we can reshape the transformer output as : (batch*seq, hidden)
step 2: index the masked words for 0-th sentence, the i-th masked word will have a index: 0 seq + i for j-th sentence, the i-th masked word will have a index: j seq + i
combine step1 and step2: we can get the hidden output of masked words (using fancy indexing)
the question is we can not do this with multi-gpu
@coddinglxf I just solved that problem with
nn.NLLLoss(ignore_index=0)
which 0 is equal to pad_index. Even if we target the 0(unmasked_value), it doesn't affect to the loss of propagation
why here output_label is 0, but not "tokens[i]"? if we set output_label =0, that means, just 15% of training data is used to train MaskedLM?
tokens[i] = self.vocab.stoi.get(token, self.vocab.unk_index) output_label.append(0)
@leon-cas yes #36 it's solved with your question
According to the code
Do we need to exclude the unmasked words when training the LM?