Andras7 / word2vec-pytorch

Extremely simple and fast word2vec implementation with Negative Sampling + Sub-sampling
178 stars 55 forks source link

window size in Word2vecDataset(Dataset) #4

Closed jackee777 closed 4 years ago

jackee777 commented 4 years ago

Hi, @Andras7. Thank you for your contribution.

I have one question about class Word2vecDataset(Dataset). In getitem(self, idx), is window_size correct?

return [(u, v, self.data.getNegatives(v, 5)) for i, u in enumerate(word_ids) for j, v in
                            enumerate(word_ids[max(i - boundary, 0):i + boundary]) if u != v]

I think this code returns wrong windows (word_ids[max(i - boundary, 0):i + boundary])) and following code (word_ids[max(i - boundary, 0):i + boundary+1])) may be correct.

return [(u, v, self.data.getNegatives(v, 5)) for i, u in enumerate(word_ids) for j, v in
                            enumerate(word_ids[max(i - boundary, 0):i + boundary+1]) if u != v]

If it is not wrong, I'm sorry for that.

In addition to this, it may not be important and I don't have confidence. if u != v needs to change if i != j.

Andras7 commented 4 years ago

Hi! You can add +1, you are right, but we should keep u != v part.

jackee777 commented 4 years ago

Thank you for the response.

You can add +1, you are right, but we should keep u != v part.

Ok, I understand :)