Comparison of Sent2Vec performance with FastText

prerna135 commented 6 years ago

@martinjaggi @mpagli @guptaprkhr @menshikh-iv I've been working on a native implementation of Sent2Vec in Gensim. During benchmarking, I came across some unexpected results. It's evident that while Sent2Vec outperforms Doc2Vec, the average of FastText word vectors for a sentence gives better results for various supervised and unsupervised tasks as compared to Sent2Vec sentence vector for the same sentence. So, is there a problem with the hyperparameters' values or are these results to be expected?

martinjaggi commented 6 years ago

@prerna135 thanks a lot for sharing this. and very nice that gensim works on a unified and native implementation of this important method family. here are some suggestions:

for a fair comparison, you really can't do unsupervised word representation learning on a small corpus of just 314 documents, and even less so for sentence representations. one should train all of them at least on the full toronto books corpus, or ideally even a larger corpus. with fasttext, another important point is that their word vectors are character-based, giving strong advantages on out of vocabulary words and compositions such as e.g. in german, and unfortunately making such word-embeddings incompatible with current learnt sentence embedding techniques such as sent2vec, doc2vec, skip-thought etc.. just averaging them is expected to perform poorly on long sentences/documents, whereas sent2vec vectors are trained to perform better on averaging.

for evaluation, have you looked into senteval? could possibly save some of the evaluation coding effort.

thanks and keep us updated!

mpagli commented 6 years ago

@prerna135 thanks for the great work !

To complete @martinjaggi 's answer on fasttext using character n-grams: given some corpus the amount of overlap of character n-grams will be likely much larger than the amount of word overlap. It is similar to having two text classifiers, one using stemmed tokens, and the other using raw tokens. For small datasets the increased overlap of stemmed tokens will be a huge advantage in favor of the first classifier. Only with enough data you will see an advantage of using raw tokens.

This to me explains why fasttext can perform better in your benchmarks. This being said those unsupervised methods are meant to take advantage of large volumes of data, and I would suggest using a large corpus such as wikipedia or toronto books in your benchmark.

I hope this is helpful? Let us know if we can assist you in any way.

prerna135 commented 6 years ago

@martinjaggi @mpagli Thanks for the prompt response. SentEval does seem like a very lucrative option. I'll try evaluating the models on the Toronto Corpus. Also, two different Sent2Vec models have been mentioned in the paper- Sent2Vec unigrams and Sent2Vec uni + bigrams. The major difference between the two is the difference in the value of the hyperparameter wordNgrams right? (1 and 2 respectively?)

mpagli commented 6 years ago

The hyperparameters we used to train our models on the toronto corpus are available in the appendix (table 5). The models using bigrams not only have wordNgrams set to 2 but also use dropout and a hashing bucket. Here are some insights on how hyperparameters behave:

embedding dim: We used 700 for almost all models, tests using sizes of 900 and larger gave slightly better results than the ones published but the training time increased.
Minimum word count: We found out that low frequency tokens tend to have a low norm and therefore do not contribute much in the final representation. This parameter is just a way to limit the vocabulary size.
Minimum target count: to limit the size of the target vocabulary size.
learning rate: 0.2 have been the magic number for our implementation.
epochs: depends on the size of the corpus you are working with.
subsampling: 10e-6 or 10e-5 are standard values
bigrams dropped: 3 to 7 is enough, this is not an extremely sensible parameter, you won't see a huge variation by playing on this parameter.
neg samples: 10 is enough, we experimented with higher and lower values, but 10 was enough.
bucket size: the default value of 2M is enough

When facing a new training task, I would start tuning only the learning rate (around 0.2), dropout, number of epochs, and subsampling hyperparameters, keeping the other ones to standard values. Use the minimum count params to restrict your vocabulary to the top few 100k most frequent tokens.

One last thing, we tokenized the toronto and wikipedia corpora using stanfordNLP. In both corpus you may have some cleaning to do as they can contain very long sentences (300+ tokens) which are just crawling errors e.g. lists of names obtained from wikipedia.

martinjaggi commented 6 years ago

@prerna135 i'm closing this for the moment, but would be nice if you keep us updated!

BramVanroy commented 4 years ago

@martinjaggi

This is quite a big bump, but would you recommend more epochs for a dataset that is smaller than the Wiki corpus? I fear overfitting, but I am not sure how big a problem that is. Considering there is no sure-fire way to evaluate semantic vectors consistently, I'm at loss how to decide when to stop training without overfitting. Is there a specific loss to aim for, or a specific number of sentences that should be seen by the model during training? If you think that is better, I can make a new issue regarding this.

epfml / sent2vec

Comparison of Sent2Vec performance with FastText #19