embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.99k stars 277 forks source link

Model which we need to implement #879

Closed KennethEnevoldsen closed 2 months ago

KennethEnevoldsen commented 6 months ago

Models which we need to implement before the full benchmark can be run.

models derived from #705

Works with sentence transformers:

Needs custom loader:

APIs

@x-tabdeveloping I know you reached out regarding tokens at some point?

x-tabdeveloping commented 6 months ago

Yes, if we can get an estimate on the number of tokens we need, Tengyu agreed to provide them for us.

Muennighoff commented 5 months ago

Maybe also BM25? Can be done via e.g. BM25Okapi

KennethEnevoldsen commented 5 months ago

I have added it to the list

vaibhavad commented 5 months ago

I will be able to add LLM2Vec supervised and unsupervised.

xhluca commented 5 months ago

Bm25okapi will be slow. It is also not an exact okapi implementation per Robertson et al; I believe the implementation differs in the IDF component in it's use of averaging to avoid division by 0 or negative IDF.

I have a new framework that will be 100x faster written in python (np, scipy, optionally Jax/pystemmer). I can release the technical report now if we plan to add it.

KennethEnevoldsen commented 5 months ago

Sounds great @xhluca, can I ask you to create the implementation for that one as well, then we will have all of desired the models implemented (we might add some later on).

xhluca commented 5 months ago

Sure!

xhluca commented 5 months ago

Btw bm25s is out: https://github.com/xhluca/bm25s

Let me know if I use some reference implementation from some other baseline (e.g. sentence-transformers) so I can easily add bm25s numbers.

I can also post a technical report now if there's a need to cite it.

KennethEnevoldsen commented 5 months ago

@xhluca an implementation of bm25 would be great. You can find a reference implementation in mteb/models.

KennethEnevoldsen commented 5 months ago

@xhluca will just ping you here

xhluca commented 5 months ago

@xhluca will just ping you here

Sure! I haven't had the chance to look into the reference implementation yet, will do today.

xhluca commented 5 months ago

Actually just saw #990, it seems the gist looks pretty decent (have not attempted to run it). Unfortunately it might be a bit hard with encode_queries/encode_corpus appraoch since BM25 does not output a representation per se, so using something like dot product is not really straightforward (there are implementations that do that, but I found that htey tend to run out of memory on 30GB systems).

KennethEnevoldsen commented 2 months ago

This PR seems resolved - will close it

Muennighoff commented 2 months ago

👍 ; also checked the other ones in your original message (BM25 etc) as they've all been added now 🚀