embeddings-benchmark / mteb

MTEB: Massive Text Embedding Benchmark
https://arxiv.org/abs/2210.07316
Apache License 2.0
1.63k stars 212 forks source link

End-to-end integration / testing with leaderboard #932

Open Muennighoff opened 2 weeks ago

Muennighoff commented 2 weeks ago

There's a few things that could be improved re: the leaderboard & codebase integration πŸ€”

cc @orionw who had some great ideas on this & is a GitHub actions wizard πŸͺ„

KennethEnevoldsen commented 2 weeks ago

Completely agree! I would love to merge the two repositories!

orionw commented 2 weeks ago

+1 to this issue @Muennighoff!

I hope to take a look at the latter 3 of these bullet points at the end of next week: making it easier to add results / mirror to Github and calculate the leaderboard automatically without refreshes.

We currently cannot automatically test the effect of changes made here on the leaderboard.

I was wondering about this myself - I think adding tests is a great starting place. It is a little tricky as the solution to the latter three involves setting the leaderboard up as a mirror on Github and doing automatic pushes, so it would draw from the main branch of wherever we store the results (mteb?). So anyone doing a branch on mteb won't be able to see the failure until it's already committed.

One potential solution to this is to add another test to mteb that runs some part of the leaderboard processing code. I think this could work, although it is not the cleanest solution (how do we sync that file so that updates to the leaderboard processing code are reflected to that test in mteb and vice versa). If others have suggestions I'd be very interested in hearing them!

Muennighoff commented 2 weeks ago

That's amazing! πŸš€

I think ideally there'd be two tests, sth like:

  1. Run a fast model on some datasets to get a result
  2. Add the result to a toy results folder
  3. Check if the leaderboard can fetch from that result folder

  1. Run a fast model on some datasets to get a result
  2. Turn the results into metadata
  3. Check if the leaderboard can fetch from the metadata

I think we only need to make sure the leaderboard code runs without erroring out which could likely be done by just parametrizing it a bit and then we can feed in the results folder & metadata as parameters for the tests. Anyways, I think the best solution here will become clearer as we advance on the other issues πŸ€”

KennethEnevoldsen commented 2 weeks ago

Wherever we store the results (mteb?)

I would add the results to mteb.

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

orionw commented 2 days ago

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

Agree as I will likely break it a few times before it's fixed haha. I've created mteb/leaderboard-in-progress which we can rename when it's sync'd correctly.

orionw commented 15 hours ago

I've created a mteb/leaderboard Github which calculates the leaderboard results daily via Github actions (a full refresh) and syncs to the Huggingface Space mteb/leaderboard-in-progress. The 1 hour refresh of all models can happen in the background at night while the space runs virtually instantaneously using those cached files!

It'd be nice to monitor it for a days or two before making it work on the main space. For those couple days, is it okay to pause any new commits to the leaderboard space? I had to make a large number of refactors and it will be a pain to try to resolve new conflicts.

Does Saturday/Sunday work for the switchover @Muennighoff? Trying to find a time that will cause the least impact if it goes down for a few hours during the transition and I'm not sure when the most active usage of the space is.

NOTE: this doesn't use the new mteb/results Github -- what is the status of that? Is that just MMTEB results?

Muennighoff commented 14 hours ago

That's amazing! Your suggestion sounds good to me & we cannot commit anything for a few days (also cc @tomaarsen). I'm not sure it would even go down but any date is fine I think.

For the mteb/results GitHub - I think we can start using it? We just need to move over all result files from the mteb/results HF repo and then sync it I think?

KennethEnevoldsen commented 8 hours ago

Yea I do think we should start using mteb/results. Instead of having multiple switches might it not be ideal to also add the interface update to MTEB as well?

Should we also add any updates on how to add models etc.?

tomaarsen commented 8 hours ago

Seems good! I'll abstain from commits on the HF Leaderboard Space in the next few days.

orionw commented 2 hours ago

For the mteb/results GitHub - I think we can start using it? We just need to move over all result files from the mteb/results HF repo and then sync it I think?

Perfect, this seems like an easy thing to update then. I'll sync those also.

Yea I do think we should start using mteb/results. Instead of having multiple switches might it not be ideal to also add the interface update to MTEB as well?

Not sure what you mean by "interface update" - do you mean UI @KennethEnevoldsen? I was planning to just update the end-to-end stuff for now.

Should we also add any updates on how to add models etc.?

I think once everything is switched over this weekend we should turn off PRs to the Spaces and only enable PRs to the Github's (as they will propagate changes). I think it will make it slightly easier to manage also since the PR features on Github are more mature.