Model which we need to implement

embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark

https://arxiv.org/abs/2210.07316

Apache License 2.0

1.99k stars 277 forks source link

Model which we need to implement #879

Closed KennethEnevoldsen closed 2 months ago

KennethEnevoldsen commented 6 months ago

Models which we need to implement before the full benchmark can be run.

models derived from #705

Works with sentence transformers:

[x] all-MiniLM-*-v2
[x] paraphrase-multilingual-*
[x] all-mpnet-*
[x] LabSE
[x] LASER

Needs custom loader:

[x] multilingual-e5-* (added in #807 and #876 as test cases)
[x] multilingual-e5-large-instruct (#888)
[x] e5-mistral-7b-instruct (as it was the first to use Mistral+instructions for embedding) (#888)
[x] GritLM-7b (@Muennighoff will anyone from the paper like to submit an implementation for these?)
[x] LLM2Vec supervised and unsupervised (@vaibhavad I imagine your group would like to add these)
[x] BM25 (see comment below)

APIs

[x] voyage-large-2-instruct (as it is first on the MTEB leaderboard right now) (#887)
[x] text-embedding-large-3 (due to its relevance in the industry) (#353)

@x-tabdeveloping I know you reached out regarding tokens at some point?

x-tabdeveloping commented 6 months ago

Yes, if we can get an estimate on the number of tokens we need, Tengyu agreed to provide them for us.

Muennighoff commented 5 months ago

Maybe also BM25? Can be done via e.g. BM25Okapi

KennethEnevoldsen commented 5 months ago

I have added it to the list

vaibhavad commented 5 months ago

I will be able to add LLM2Vec supervised and unsupervised.

xhluca commented 5 months ago

Bm25okapi will be slow. It is also not an exact okapi implementation per Robertson et al; I believe the implementation differs in the IDF component in it's use of averaging to avoid division by 0 or negative IDF.

I have a new framework that will be 100x faster written in python (np, scipy, optionally Jax/pystemmer). I can release the technical report now if we plan to add it.

KennethEnevoldsen commented 5 months ago

Sounds great @xhluca, can I ask you to create the implementation for that one as well, then we will have all of desired the models implemented (we might add some later on).

xhluca commented 5 months ago

Sure!

xhluca commented 5 months ago

Btw bm25s is out: https://github.com/xhluca/bm25s

Let me know if I use some reference implementation from some other baseline (e.g. sentence-transformers) so I can easily add bm25s numbers.

I can also post a technical report now if there's a need to cite it.

KennethEnevoldsen commented 5 months ago

@xhluca an implementation of bm25 would be great. You can find a reference implementation in mteb/models.

KennethEnevoldsen commented 5 months ago

@xhluca will just ping you here

xhluca commented 5 months ago

@xhluca will just ping you here

Sure! I haven't had the chance to look into the reference implementation yet, will do today.

xhluca commented 5 months ago

Actually just saw #990, it seems the gist looks pretty decent (have not attempted to run it). Unfortunately it might be a bit hard with encode_queries/encode_corpus appraoch since BM25 does not output a representation per se, so using something like dot product is not really straightforward (there are implementations that do that, but I found that htey tend to run out of memory on 30GB systems).

KennethEnevoldsen commented 2 months ago

This PR seems resolved - will close it

Muennighoff commented 2 months ago

👍 ; also checked the other ones in your original message (BM25 etc) as they've all been added now 🚀