codertimo / BERT-pytorch

Google AI 2018 BERT pytorch implementation
Apache License 2.0
6.18k stars 1.3k forks source link

why language_model.py has different vectors #91

Open zysNLP opened 3 years ago

zysNLP commented 3 years ago

In language_model.py codes, class NextSentencePrediction and class MaskedLanguageModel has different input x in their forward function, In class NextSentencePrediction the input use "x[: 0] in "return self.softmax(self.linear(x[:, 0]))" but in class MaskedLanguageModel the input use x in "return self.softmax(self.linear(x))", I think if there is something wrong?

As I debug to this, both x in their has shape of (batch_size, seq_len, embedding_dim) like (64, 50, 256), we known this mains I have 64 sentences and each sentences has 50 words and each word is a 256 dim of vector. But the x[:, 0] means I take the first word in every 64 sentences so lead x[:, 0] to has shape of (64, 256). I don't understand why the task NextSentencePrediction should use this kind of input, can someone help me to explain this?

boykis82 commented 3 years ago

x[:, 0] includes all semantics of x[:,0:50] because of self attention. You can use x[:,0] or x[:,1] or sum(x, axis=1) or mean(x, axis=1) ... whatever you wants. But in my experience, there are no performance difference. It's enough to use only x[:,0] when you train classification task.