End-to-end integration / testing with leaderboard

Muennighoff commented 2 weeks ago

There's a few things that could be improved re: the leaderboard & codebase integration 🤔

We currently cannot automatically test the effect of changes made here on the leaderboard. This has caused a few issues: https://github.com/embeddings-benchmark/mteb/issues/903 ; https://github.com/embeddings-benchmark/mteb/issues/639
The leaderboard requires manual refreshes (the benefit is that it allows us to do a quick manual check that all looks okay, but maybe automatic would still be better)
Adding results via https://huggingface.co/datasets/mteb/results requires updating the paths.json file in there + adding the model specs to the leaderboard. Ideally, we would allow users to just submit all of this via PR and it would automatically update similar to how people add their models to alpaca, e.g. https://github.com/tatsu-lab/alpaca_eval/pull/342
Reviewing PRs on HF does not yet work as well as on GitHub - it may be easier to take them on GitHub instead and auto-push them to the LB upon changes

cc @orionw who had some great ideas on this & is a GitHub actions wizard 🪄

KennethEnevoldsen commented 2 weeks ago

Completely agree! I would love to merge the two repositories!

orionw commented 2 weeks ago

+1 to this issue @Muennighoff!

I hope to take a look at the latter 3 of these bullet points at the end of next week: making it easier to add results / mirror to Github and calculate the leaderboard automatically without refreshes.

We currently cannot automatically test the effect of changes made here on the leaderboard.

I was wondering about this myself - I think adding tests is a great starting place. It is a little tricky as the solution to the latter three involves setting the leaderboard up as a mirror on Github and doing automatic pushes, so it would draw from the main branch of wherever we store the results (mteb?). So anyone doing a branch on mteb won't be able to see the failure until it's already committed.

One potential solution to this is to add another test to mteb that runs some part of the leaderboard processing code. I think this could work, although it is not the cleanest solution (how do we sync that file so that updates to the leaderboard processing code are reflected to that test in mteb and vice versa). If others have suggestions I'd be very interested in hearing them!

Muennighoff commented 2 weeks ago

That's amazing! 🚀

I think ideally there'd be two tests, sth like:

Run a fast model on some datasets to get a result
Add the result to a toy results folder
Check if the leaderboard can fetch from that result folder

Run a fast model on some datasets to get a result
Turn the results into metadata
Check if the leaderboard can fetch from the metadata

I think we only need to make sure the leaderboard code runs without erroring out which could likely be done by just parametrizing it a bit and then we can feed in the results folder & metadata as parameters for the tests. Anyways, I think the best solution here will become clearer as we advance on the other issues 🤔

KennethEnevoldsen commented 2 weeks ago

Wherever we store the results (mteb?)

I would add the results to mteb.

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

orionw commented 2 days ago

To avoid influencing the existing leaderboard too much it might be ideal to keep the existing one as is for now and create a new leaderboard for development.

Agree as I will likely break it a few times before it's fixed haha. I've created mteb/leaderboard-in-progress which we can rename when it's sync'd correctly.

orionw commented 15 hours ago

I've created a mteb/leaderboard Github which calculates the leaderboard results daily via Github actions (a full refresh) and syncs to the Huggingface Space mteb/leaderboard-in-progress. The 1 hour refresh of all models can happen in the background at night while the space runs virtually instantaneously using those cached files!

It'd be nice to monitor it for a days or two before making it work on the main space. For those couple days, is it okay to pause any new commits to the leaderboard space? I had to make a large number of refactors and it will be a pain to try to resolve new conflicts.

Does Saturday/Sunday work for the switchover @Muennighoff? Trying to find a time that will cause the least impact if it goes down for a few hours during the transition and I'm not sure when the most active usage of the space is.

NOTE: this doesn't use the new mteb/results Github -- what is the status of that? Is that just MMTEB results?

Muennighoff commented 14 hours ago

That's amazing! Your suggestion sounds good to me & we cannot commit anything for a few days (also cc @tomaarsen). I'm not sure it would even go down but any date is fine I think.

For the mteb/results GitHub - I think we can start using it? We just need to move over all result files from the mteb/results HF repo and then sync it I think?

KennethEnevoldsen commented 8 hours ago

Yea I do think we should start using mteb/results. Instead of having multiple switches might it not be ideal to also add the interface update to MTEB as well?

Should we also add any updates on how to add models etc.?

tomaarsen commented 8 hours ago

Seems good! I'll abstain from commits on the HF Leaderboard Space in the next few days.

Tom Aarsen

orionw commented 2 hours ago

For the mteb/results GitHub - I think we can start using it? We just need to move over all result files from the mteb/results HF repo and then sync it I think?

Perfect, this seems like an easy thing to update then. I'll sync those also.

Yea I do think we should start using mteb/results. Instead of having multiple switches might it not be ideal to also add the interface update to MTEB as well?

Not sure what you mean by "interface update" - do you mean UI @KennethEnevoldsen? I was planning to just update the end-to-end stuff for now.

Should we also add any updates on how to add models etc.?

I think once everything is switched over this weekend we should turn off PRs to the Spaces and only enable PRs to the Github's (as they will propagate changes). I think it will make it slightly easier to manage also since the PR features on Github are more mature.

embeddings-benchmark / mteb

End-to-end integration / testing with leaderboard #932