Why output_label=0 in datasets generation for Masked LM

codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation

Apache License 2.0

6.11k stars 1.29k forks source link

Why output_label=0 in datasets generation for Masked LM #36

Closed leon-cas closed 5 years ago

leon-cas commented 5 years ago

In dataset.py, function 'random_word', line90, why the output_label of 85% data(no masking) is set to 0 ， output_label.append(0)？

codertimo commented 5 years ago

@leon-cas Cause no masking shouldn't be trained by optimizer, and 85% rate is came from the paper. We masked the 0 value which can't be trained through backprobagated

leon-cas commented 5 years ago

So you mean only 15% of all data are used to train MLM？

codertimo commented 5 years ago

Yes and it's noticed on the paper too.

jiqiujia commented 5 years ago

It means each word in a sentence is masked out with 15% probability and MLM is trained to predict the masked words. Please read the paper carefully.

leon-cas commented 5 years ago

@jiqiujia @codertimo thanks, guys.