Change the leaderboard to better measure OOD performance

orionw commented 4 days ago

@x-tabdeveloping is working on the new leaderboard here with awesome progress towards making it customizable (e.g. "select your own benchmark").

Along with this, a common theme I heard at SIGIR, on twitter, and in conversations with others is the complaint that BEIR (and MTEB in general) was supposed to be zero-shot and now most SOTA models train on all the training sets and use dev/test sets of BEIR as validation data. This of course makes it trivial to overfit (as also shown by the MTEB Arena).

One way we could better measure OOD performance is to tag certain models as only having trained on "approved" in-domain data while the test data is purely out of domain. This could be something like "MS MARCO" training is allowed and the evaluation is done on BEIR (minus MS MARCO). The exact specifics of allowed data would need to be worked out (what about synthetic generation, NQ, etc.).

I do not work for any company that creates embeddings as an API and for these groups I can see and understand the reasoning behind training on all available good training sets. However, I think for good science and evaluation, it seems like we should encourage a distinct split where datasets are not used for validation/training in order to measure true OOD performance. Otherwise it's getting hard to tell what are actually improvements and which models are better at not filtering the test data out (or at overfitting to the test data by using mini-versions of the test set for validation).

I believe that the MTEB leaderboard could be a driving force behind this change, if we want it to be. One way could be making the default leaderboard this OOD one, where only models with approved data would be shown. And we would of course still have a tab where all data is fair game.

However, as someone who doesn't work at these companies I likely have a biased perspective and would love to hear from others.

bwanglzu commented 4 days ago

Can not agree more. Let's take CMTEB as example, I highly suspect not even training set, but also test set is being used.

I think some basic descriptive statistics can already help: taking avg/median and mark models has a suspiciously high score in the UI, and resolve unless author disclose information & resolve the concern.

Liuhong99 commented 4 days ago

Agreed! For retrieval, it seems a lot of models use Msmarco, NQ, HotpotQA, DBPedia, FEVER, Quora, FiQA, and SciFact training set as indicated in their papers or reports. For classification, especially EmotionClassification, I eyeballed the dataset and also asked GPT to label some samples. My rough estimation is that there are at least 20% noisy labels in the test set. It's generally mysterious how models get above 80% accuracy on this task.

Screenshot of SFR embedding blogpost: image (7)

Screenshot of NV-Embed paper (I think ArguAna only has a test set):

embeddings-benchmark / leaderboard

Change the leaderboard to better measure OOD performance #41