Closed KennethEnevoldsen closed 2 months ago
Yes, if we can get an estimate on the number of tokens we need, Tengyu agreed to provide them for us.
Maybe also BM25? Can be done via e.g. BM25Okapi
I have added it to the list
I will be able to add LLM2Vec supervised and unsupervised.
Bm25okapi will be slow. It is also not an exact okapi implementation per Robertson et al; I believe the implementation differs in the IDF component in it's use of averaging to avoid division by 0 or negative IDF.
I have a new framework that will be 100x faster written in python (np, scipy, optionally Jax/pystemmer). I can release the technical report now if we plan to add it.
Sounds great @xhluca, can I ask you to create the implementation for that one as well, then we will have all of desired the models implemented (we might add some later on).
Sure!
Btw bm25s is out: https://github.com/xhluca/bm25s
Let me know if I use some reference implementation from some other baseline (e.g. sentence-transformers) so I can easily add bm25s numbers.
I can also post a technical report now if there's a need to cite it.
@xhluca an implementation of bm25 would be great. You can find a reference implementation in mteb/models
.
@xhluca will just ping you here
@xhluca will just ping you here
Sure! I haven't had the chance to look into the reference implementation yet, will do today.
Actually just saw #990, it seems the gist looks pretty decent (have not attempted to run it). Unfortunately it might be a bit hard with encode_queries/encode_corpus appraoch since BM25 does not output a representation per se, so using something like dot product is not really straightforward (there are implementations that do that, but I found that htey tend to run out of memory on 30GB systems).
This PR seems resolved - will close it
👍 ; also checked the other ones in your original message (BM25 etc) as they've all been added now 🚀
Models which we need to implement before the full benchmark can be run.
models derived from #705
Works with sentence transformers:
Needs custom loader:
APIs
@x-tabdeveloping I know you reached out regarding tokens at some point?