epfml / sent2vec

General purpose unsupervised sentence representations
Other
1.19k stars 256 forks source link

How about consider subword in sen2vec #23

Closed wugh closed 6 years ago

wugh commented 6 years ago

The sen2vec is more like the supervise fasttex. have you tried add subword info for each word, so the model can deal with oov.

martinjaggi commented 6 years ago

indeed, subwords could potentially also help for our unsupervised setting. as our code is very close to the fasttext one, it should be possible to add that feature. if someone would contribute it would be nice!

wugh commented 6 years ago

@martinjaggi I try to add sub word information when we get sentence level context, but the performance on my dataset is decrease a little bit. (The train and test corpus is Chinese)

no sub word version performance

pearson = 0.3519, spearman = 0.3645, auc = 0.7104

sub word version performance

pearson = 0.3464, spearman = 0.3486, auc = 0.7013

I'm think whether is it good to add sub word information. When the context get long, the sub word information will be quite noise, because I put all sub word and word in same level as a bow. And May be Chinese does not contain very much sub word information. Do you have some idea about this?

mpagli commented 6 years ago

Thanks for letting us know about your results :).

I'm not sure how you implemented adding the sub-word information but I guess you're applying the same method used by fasttext to create word embeddings, but on the whole sentence. Fasttext is summing the sub-word embeddings to get a word embedding, you might be averaging the word and sub-word embeddings for the whole sentence. Which in itself is quite different. If you have a sentence of 2 words, let's say the first word has 9 sub-components and the second has 3. The weight you might be giving to each component is 1/14. I believe it makes sense to average word representations to get sentence embeddings, so each word should be given the same weight (1/2 in our case). So 1/2 should be shared by the 9 subcomponent and word embeddings for the first word, and similarily 1/2 is shared by the elements making the representation of the second word. So each constituent of the first word might end up with a weight of 1/20, and constituents of the second word with 1/8. This way of first summing subcomponents and then averaging the results of those sums might be more adapted.

Also using subword information should mainly help tasks involving OOV, if your evaluation task is not having that many OOVs then you might not be able to see any improvement.

wugh commented 6 years ago

Thank you @mpagli . Actually you are right, my implementation give every sub-component of the sentence level bow same result. But I see the CBOW of Fasttext is doing this too, i.e. share same weight for every component of the context.