UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.23k stars 2.47k forks source link

Performance of the pretrained model #49

Closed Kyubyong closed 4 years ago

Kyubyong commented 5 years ago

I ran the following command:

python examples/evaluation_stsbenchmark.py

And I got the following results:

2019-11-06 09:47:12 - Cosine-Similarity : Pearson: 0.7415 Spearman: *0.7698 2019-11-06 09:47:12 - Manhattan-Distance: Pearson: 0.7730 Spearman: 0.7712 2019-11-06 09:47:12 - Euclidean-Distance: Pearson: 0.7713 Spearman: 0.7707 2019-11-06 09:47:12 - Dot-Product-Similarity: Pearson: 0.7273 Spearman: 0.7270

I'm confused because you reported the best performance is 77.12 for cosine-similarity and spearman. According to the results above, it's 76.98. Please correct me if I'm wrong.

nreimers commented 5 years ago

Hi @Kyubyong, I started to report the maximum for the scores cosine/manhatten/euclidean/dot-product. I'm sorry if somewhere is still mentioned that the reported scores in the readme are from consine-similarity.

Cosine / Manhatten / Euclidean / Dot-Product are computational wise quite comparable, i.e., if I have an unsupervised task like semantic search, then, from the computational overhead, it does not really matter if I use cosine similarity, Manhatten distance, Euclidean distance, or dot product. The computation is comparable (sometimes equivalent) and for each metric, efficient index structures can be created.

For most sentence embeddings methods, the choice of Cosine/Manhatten/Euclidean/Dot makes no large difference and scores are comparable. But for some sentence embeddings methods, it makes a big difference.

For example when I used the XLNet methods, I got quite bad scores with cosine-similarity, about 20 percentage points lower than with Manhatten distance.

In order to eliminate the impact of the distance function, I think it is better to test with several functions (cosine/manhatten/eucl.) and to see what works for the selected sentence embeddings methods. In most cases, the differences are not that large. But in same cases, it can play a big role.

Best regards Nils Reimers