UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.16k stars 2.47k forks source link

performance numbers on the model overview #2475

Open njjry opened 8 months ago

njjry commented 8 months ago

Hello,

I am reading this webpage https://sbert.net/docs/pretrained_models.html about model comparisons. I am a little confused with the "Performance Sentence Embeddings" and "Performance Semantic Search" results in the table. How are these two metrics measured? What does the number mean? Are they the results of comparing the similarity score to some gold standard and see what percentage of the scores are the same?

Thanks, Lisa

tomaarsen commented 8 months ago

Hello!

First of all, "Performance Sentence Embeddings" refers to the performance of the model on various different tasks. Admittedly, I don't know which ones were used, but I suspect that it includes some classification, clustering, semantic search, etc. "Performance Semantic Search" refers to only benchmarks for semantic search, i.e. given a question or search query, how well can the model find relevant text passages through embedding similarity.

Secondarily, the reported scores are most likely Spearman rank correlation based on cosine similarity, at least for the Semantic Search performance. In a nutshell, it measures the similarity of a pair of embeddings against the gold standard similarity score for those sentences. The correlation score is higher if there is a higher correlation between the similarity score and the gold standard score, maxing out at a score of 1 (or 100) if the two values are monotonically related. It kind of measures how well a higher predicted label indeed corresponds with a higher gold label, giving some confidence that the similarity scores of the model are correct/useful.

For the "Performance Sentence Embeddings", I think it's very possible that the score is just an average of various different scores, even if they have different kinds of measurements. For example, just the average between an accuracy on a classification task, Spearman correlation for a semantic search task, Validity Measure for a clustering task, Normalized Discounted Cumulative Gain @ k for Retrieval, etc.