PengFoo / word2vec-pytorch

A Skip-Gram model of Word2vec implemented in pytorch.
9 stars 10 forks source link

subsampling #1

Open Mayar2009 opened 5 years ago

Mayar2009 commented 5 years ago

in def gen_vocab(self) we select the vocab that have number of freq >=self.min_count like this:

  1. `vocab, word2id, id2word = {}, {}, {}
  2. index = 0
  3. for item_id, freq in vocab_freq_dict.items():
  4. if freq < self.min_count:
  5. continue
  6. vocab[item_id] = freq
  7. word2id[item_id] = index
  8. id2word[index] = item_id
  9. index += 1
  10. return vocab, word2id, id2word, total_word_count, total_sent_count`

can you please clarify this function (def gen_subsample_table(self))?

`

  1. def gen_subsample_table(self):
  2. """
  3. sub sampling rate, higher than that would be sub sampled using
  4. the word2vec paper using: p(w_i) = 1 - sqrt(sub_sampling / freq)
  5. the word2vec code using: p(w_i) = 1 - (sqrt(sub_sampling / freq) + sub_sampling / freq)
  6. we use word2vec code sub sampling method here.
  7. :return: {word_id: sample_score}
  8. """
  9. def sub_sampling(_freq):
  10. return (self.sub_sampling_t / 1.0 / _freq) ** 0.5 + self.sub_sampling_t / 1.0 / _freq
  11. word freq count to word freq ratio

  12. sub_sample_tbl = {item: freq / 1.0 / self.total_word_count
  13. for item, freq in self.vocab.items()
  14. if freq / 1.0 / self.total_word_count > self.sub_sampling_t}
  15. freq to score

  16. sub_sample_tbl = {item: sub_sampling(_freq) for item, _freq in sub_sample_tbl.items()}
  17. word to id

  18. sub_sample_tbl = {self.word2id[i]: j for i, j in sub_sample_tbl.items() if j < 1}
  19. return sub_sample_tbl

` line 9

  1. def sub_sampling(_freq): it looks like it returns ( p(w_i) = (sqrt(sub_sampling / freq) + sub_sampling / freq) ) not ( p(w_i) = 1 - (sqrt(sub_sampling / freq) + sub_sampling / freq) ) right?

why this line ?

  1. if freq / 1.0 / self.total_word_count > self.sub_sampling_t} if we before used
  2. if freq < self.min_count: in the def gen_vocab(self) function in the first part of the question

what is the meaning of this line?

  1. sub_sample_tbl = {self.word2id[i]: j for i, j in sub_sample_tbl.items() if j < 1}

thank you!

Mayar2009 commented 5 years ago

and a question please where did you use sub_sampling_table? in code it is not used anywhere it is so strange