allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
321 stars 43 forks source link

beir suite #58

Open seanmacavaney opened 3 years ago

seanmacavaney commented 3 years ago

Dataset Information:

Beir is a suite of benchmarks, intended to be used for testing zero-shot transfer.

These would help extend the tool beyond primarily ad-hoc tasks.

Their benchmarks perform their own pre-processing. For identical comparisons, we should use their same pre-processing (rather than aliasing with the versions we have, where there's overlap). It should be easy to support the datasets they have available as downloads.

Links to Resources:

Dataset ID(s):

Supported Entities

Additional comments/concerns/ideas/etc.

Need to be sure to include both the original dataset citation and the citation to Beir in the dataset documentation.

Could having several versions of the same dataset cause confusion? The documentation should provide information to disambiguate.

seanmacavaney commented 3 years ago

A downside of adding these is that they only consist of test components. If folks wanted to train on some of these (at least, the ones that have training data), they'd be out of luck until somebody gets around to adding the full version of the datasets.

seanmacavaney commented 3 years ago

I stand corrected on the topic of metadata & other fields. Some is provided for queries and docs under the metadata key.

seanmacavaney commented 3 years ago

@searchivarius reminds me that the BEIR doc objects ought to be better structured, especially RE the metadata.

Most are either (doc_id, text) or (doc_id, title, text). A few have (undocumented?) metadata as a dictionary, but should be able to properly structure these in a custom namedtuple for that particular corpus.

searchivarius commented 3 years ago

Thank you. I am currently fine with it, but if you add more structure in the future, this will be great.