embeddings-benchmark / arena

Code for the MTEB Arena
https://hf.co/spaces/mteb/arena
14 stars 6 forks source link

Support for Retrieval-only models (ColBERT?) #32

Open bclavie opened 1 month ago

bclavie commented 1 month ago

Hey!

Great job on Arena, I think in the era of saturated benchmarks, having an actual large-number vibes-based evaluation is very important.

I was wondering, would you entertain adding models that are only available for one of the three categories? I think retrieval is by far the most popular use for embeddings nowadays, so I could see it making sense, but I can also understand if not.

If so, I'd be happy to contribute a ColBERT implementation, as we're working potential English proof-of-concepts with the ColBERTv2.5 recipe which I think could be very interesting to try out in this benchmark!

With compression, etc..., the indexes should also be within the same ~order of magnitude as the ones of 1024 dim vectors, so it shouldn't be too much of a storage nightmare.

Muennighoff commented 1 month ago

We are definitely willing to add models for only one of the categories - we also already have BM25 only for Retrieval. Such a contribution would be amazing, looking forward to it! 🚀

bclavie commented 1 month ago

Great, I'll try to have the next version of RAGatouille done in a way that'll make it very friendly to this task!

A quick question in terms of logistics: given the easiest way to do large-scale embeddings w/ ColBERT is PLAID, would you want a PLAID index to be built as part of the leaderboard, or would you rather I expose an API endpoint arena could use to query the collection as a "demo" to trial out the integration first?

Muennighoff commented 1 month ago

The API endpoint is probably easier as we won't have to keep it in memory but else we can also try the PLAID index. In any case, it'd be great to have the code for creating the index in the repo so others can also run it locally if they want.

bclavie commented 1 month ago

Sounds great! I'll try and get things indexed in the next few days, w/ public indexing code & uploading the indices themselves to HF. I'll have a new release in the next few days & I'm very curious if the vibes eval will match the benchmarks.

I'm a bit new to HF spaces -- I'm assuming you can set environment variables there? If so & you don't mind adding it, I'll send over an API key to avoid having to seriously rate limit the endpoint!

Muennighoff commented 1 month ago

Yes we can set env variables & keep them private (same as done for OpenAI models etc) 👍

bclavie commented 4 weeks ago

Quick update on this: I think I've got most of the parts working and will be updating the PR soon, but I'm of two minds as to whether we want to use this 33M param model to demo ColBERT, since it'll be the first intro for a lot of people 🤔, and I'm increasingly finding that it's very strong at k={5, 10, 10+} but that the tradeoff for its size seem to be pretty poor r@1 and mediocre r@3. I know multiple people are currently training ~100M param variants, so might be wiser to wait for then?

Muennighoff commented 4 weeks ago

What about this one: https://huggingface.co/colbert-ir/colbertv2.0? Seems like it is the most downloaded colbert model on HF so probably of most interest to the community, no?