Add Similarity Metric Used to leaderboard

embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

https://arxiv.org/abs/2210.07316

Apache License 2.0

1.79k stars 237 forks source link

Add Similarity Metric Used to leaderboard #766

Closed aminst closed 3 months ago

aminst commented 3 months ago

Hi, thanks for this awesome benchmark.
Is it possible to add the similarity metric used in each model in the benchmark? From what I understand, the choice of what similarity metric is used in each model influences what similarity metric people should use when storing the generated embeddings in a vector database for later similarity searches. I believe this would help people to easily choose what similarity metric to use when storing the embeddings. I can help and add this if it's valuable. Thanks!

KennethEnevoldsen commented 3 months ago

@tomaarsen what are your thoughts on adding this to the leaderboard? My guess is that almost all models would use cosine sim. in which case it wouldn't add much information

tomaarsen commented 3 months ago

@KennethEnevoldsen I do think it makes sense to show this in the leaderboard for all tasks - I think we currently only say it for STS:

Metric: Spearman correlation based on cosine similarity

But the other tasks primarily (exclusively?) use Cosine Similarity too. There are some models/tasks that perform a bit better with (non-normalized) dot as it prefers longer passages, but they're few and far between & not high on the leaderboard.

Tom Aarsen

KennethEnevoldsen commented 3 months ago

From my understanding, @aminst refers to the intended distance metric of the model itself (@aminst do correct me if I am wrong) and not the task?

However, I do agree that a model might have been trained with a different metric in mind, and assuming a distance metric seems problematic. I would ideally allow the model to supply the distance metric and then we just report the score (e.g. spearman correlation) for whatever distance metric the model selects.

aminst commented 3 months ago

@KennethEnevoldsen Yes, that is exactly what I meant. It would be great if the leaderboard also shows the distance metric the model used during training. It would also help people to not misuse the embeddings with a different metric. The use case I have in mind is the following, does it make sense?

Somebody wants to convert their data into vector embeddings and store it in a vector database for later retrieval and semantic search.
The person uses the leaderboard to find the model to use.
They should manually search for the distance metric to use, which the leaderboard itself can offer.

tomaarsen commented 3 months ago

Ohh, I see! Yes, that would indeed be optimal. I realised something similar with Sentence Transformers, so in Sentence Transformers v3 it will be possible to configure the similarity function in the model configuration. This will then be used when calling the new SentenceTransformer.similarity or SentenceTransformer.similarity_pairwise methods.

Additionally, ST models will start reporting their similarity function in the model card automatically, e.g. here.

That should help, at least with ST-based models.

Tom Aarsen

KennethEnevoldsen commented 3 months ago

It sounds like this is something that we might consider adding after the additions to ST3. I will leave the issue open, but atm. we probably won't add it in.

KennethEnevoldsen commented 3 months ago

I have added an issue related to using a custom sim. within the benchmark, but for the similarity of the model we will probably leave that to the model card.

edit: will close for now, but feel free to re-open the discussion if you believe that there is more to add.