Some retrieval datasets don't calculate metadata correctly

orionw commented 1 week ago

When running task.calculate_metadata_metrics() for retrieval tasks, there are a handful that fail to run (most of the ~130 work though, which is great!)

They are:

BelebeleRetrieval
NorQuadRetrieval
SwednRetrieval
PublicHealthQA
CovidRetrieval
CrossLingualSemanticDiscriminationWMT19
CrossLingualSemanticDiscriminationWMT21
DanFEVER
SweFaqRetrieval
TV2Nordretrieval
CodeEditSearchRetrieval
TwitterHjerneRetrieval
MLQARetrieval
SNLRetrieval
IndicQARetrieval
XPQARetrieval

Almost all the errors are due to things like:

ValueError: BuilderConfig 'corpus' not found. Available: ['default']

At some point we should resolve this, either via changes to the calculate_metadata_metrics function that use some parameter passed in, or by changing the tasks __init__ function to define the needed parts.

cc'ing @KennethEnevoldsen as an FYI. The RetrievalStats tab of this Google Sheet has the stats for the current tasks that did succeed. Around 30 will need to be reduced in size, plus some of the above that I do not have stats for potentially.

KennethEnevoldsen commented 1 week ago

Thanks for this, Orion - I think most of these are quite small (SNL, Twitterhjerne, TV2Nord, SweFaq, DanFEVER, NorQuad and Swedn)

but the others might be large enough. I will set this to help wanted to allow someone to grab it

isaac-chung commented 1 week ago

Clues for whoever takes this: this is reproducible in python using

from mteb import get_tasks

task = get_tasks(tasks=["NorQuadRetrieval"])[0]
task.calculate_metadata_metrics()

which then gives an error from HFDataLoader._load_corpus(self):

    122 def _load_corpus(self):
    123     if self.hf_repo:
--> 124         corpus_ds = load_dataset(
    125             self.hf_repo,
    126             "corpus",
    127             keep_in_memory=self.keep_in_memory,
    128             streaming=self.streaming,
    129         )

Seems like the method looks for a corpus subset that is not present. Comparing to a dataset like NFCorpus, mteb/norquad_retrieval does not have that subset.

In fact, the following command is also broken, with the same error above:

mteb run -t NorQuadRetrieval -m intfloat/multilingual-e5-base --model_revision d13f1b27baf31030b7fd040960d60d909913633f

KennethEnevoldsen commented 1 week ago

Seems like some of these are due to missing load_data() functions. I have added these in #953 to ensure that as much of the possible runs for @Muennighoff.

This resolved the issues of the task being unable to load for some of the datasets (SNL, Twitterhjerne, TV2Nord, SweFaq, DanFEVER, NorQuad and Swedn) as well as solve the statistics calculation issue (though not for Norquad). I have to run to another thing so can't look more into it, but this at least solve about half

isaac-chung commented 1 week ago

I'll take a look at this.

embeddings-benchmark / mteb

Some retrieval datasets don't calculate metadata correctly #964