epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

How does sent2vec differ from fasttext-supervised #22

Closed thomasahle closed 6 years ago

thomasahle commented 6 years ago

I'm trying to understand the exact difference between sent2vec and running fasttest-supervised on lines like this:

some sentence with words __label__some __label__sentence __label__with __label__words

Reading your paper and code, I think you're holding out the word on the lhs, that you are trying to predict on the rhs. E.g. you run

train(sentence with words,  some)
train(some with words,  sentence)
train(some sentence words,  with)
train(some sentence with,  words)

where as fasttext-supervised would include the rhs word in each of those four calls. Is this correct, or am I missing other differences between the two systems?

mpagli commented 6 years ago

It is mostly correct, you could approximate the unsupervised task by duplicating sentences each time selecting a different label. One difference is that we use subsampling (online) on the target words so not every word would be selected as label. We also apply dropout when using ngrams. Finally, in your example you have some bigrams such as "some with" that shouldn't be there, so sent2vec handles the ngrams generation properly.