Open violenil opened 3 months ago
Thanks! For the arena that's live we're actually using the GCP index which normalizes first & then does dot product i.e. cosine similarity: https://github.com/embeddings-benchmark/arena/blob/64a8780d596018912905523406621eed62a9a417/retrieval/gcp_index.py#L160
We should definitely adapt it for the local index though, but I think it should be done in the models folder i.e. here: https://github.com/embeddings-benchmark/mteb/tree/main/mteb/models
I think we should probably add Jina in this file: https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/sentence_transformers_models.py & then activate normalization there so it always uses normalization when loaded via mteb.get_model(...
- cc @KennethEnevoldsen
Yep, that is totally correct - the https://github.com/embeddings-benchmark/mteb/blob/main/mteb/models/
folder is the gold standard reference for evaluated models.
Hi! Loving the Arena for quick inspection of models :)
I noticed that the scores for the retrieval are computed as dot products, as opposed to cosine similarity, even though the embeddings are not normalized. I manually added normalization during a local deployment and got significantly different results, at least for the
jinaai/jina-embeddings-v2-base-en
model. Do you think we can add an optional parameter to themodel_meta.yml
to normalize embeddings during themodel.encode
call? I'm happy to make a PR.