Resolve inconsistent comparison on STS Benchmark

sidak commented 5 years ago

The numbers on STS benchmark seem to computed by just cosine similarity between sentence embeddings, also check Ref: section 8.

But there are several papers which compare on STS Benchmark after training a classifier on the top, like done via SentEval. This leads to unfair comparisons across various methods.

To give one concrete example, I ran GloVe (commoncrawl) with STSBenchmark (via SentEval) and it gives a score of 64.74 on the test set. (This is corroborated by another paper which claims of getting same score for GloVe via SentEval)

While, if one just does the evaluation (via cosine alone) it gives a score on test set of 41.5 (with commoncrawl) and 40.8 (with wiki+gigawords 6B as mentioned in SemEval paper), which matches with the scores on the above benchmark website. Similar would apply for other methods evaluated in this manner.

I agree that SentEval clearly specifies train=1 in the readme, but since right now SentEval is the most common evaluation, I think it would be good to explicitly allow unsupervised evalutions on STS Benchmark (which is what this pull request provides).

I think it would be really useful for the community to be aware of this and thus helping in consistent comparisons. Thanks! 😄

facebook-github-bot commented 5 years ago

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours has expired.

Before we can review or merge your code, we need you to email cla@fb.com with your details so we can update your status.

facebook-github-bot commented 4 years ago

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

facebookresearch / SentEval

Resolve inconsistent comparison on STS Benchmark #62