beir suite - Githubissues

seanmacavaney commented 3 years ago

Dataset Information:

Beir is a suite of benchmarks, intended to be used for testing zero-shot transfer.

These would help extend the tool beyond primarily ad-hoc tasks.

Their benchmarks perform their own pre-processing. For identical comparisons, we should use their same pre-processing (rather than aliasing with the versions we have, where there's overlap). It should be easy to support the datasets they have available as downloads.

Links to Resources:

Dataset ID(s):

beir (empty placeholder)
beir/msmarco (docs, queries, qrels) --- the MS MARCO passage collection, dev subset; should correspond with msmarco-passage/dev, but there may be differences in pre-processing
trec-covid (docs, queries, qrels) --- the TREC COVID complete benchmark; should correspond with trec-covid, but there may be differences in pre-processing. Plus not all metadata is available in their experimental setting, and it only uses the natural language questions
beir/nfcorpus (docs, queries, qrels) --- NFCorpus, unclear which is the corresponding irds ID, but presumably some filtered portion of the test set?
~beir/bioasq (docs, queries, qrels)~ not available for download
beir/nq (docs, queries, qrels) --- another version of the natural questions dev dataset; different preprocessing than natural-questions/dev and dpr-w100/natural-questions/dev as document selection is different and different filtering for queries
beir/hotpot (docs, queries, qrels) --- HotpotQA
beir/fiqa (docs, queries, qrels) --- FiQA-2018
~beir/signal1m (docs, queries, qrels)~ not available for download --- Signal-1M(RT)
~beir/trec-news (docs, queries, qrels)~ not available for download --- TREC Background Linking
beir/arguana (docs, queries, qrels) --- ArguAna Counterargument retrieval
beir/webis-touche2020 (docs, queries, qrels) --- Touche-2020 conversational arguments
beir/cqadupstack (docs, queries, qrels) --- CQADupstack community question answering
beir/quora (docs, queries, qrels) --- Quora duplicate question identification
beir/dbpedia-entity (docs, queries, qrels) --- DBPedia entity linking
beir/scidocs (docs, queries, qrels) --- SCIDOCS citation prediction
beir/fever (docs, queries, qrels) --- FEVER fact verification
beir/climate-fever (docs, queries, qrels) --- Climate-FEVER fact verification on climate topics
beir/scifact (docs, queries, qrels) --- SciFact fact verification from scientific literature

Supported Entities

[x] docs
[x] queries
[x] qrels
[ ] scoreddocs
[ ] docpairs

Additional comments/concerns/ideas/etc.

Need to be sure to include both the original dataset citation and the citation to Beir in the dataset documentation.

Could having several versions of the same dataset cause confusion? The documentation should provide information to disambiguate.

seanmacavaney commented 3 years ago

A downside of adding these is that they only consist of test components. If folks wanted to train on some of these (at least, the ones that have training data), they'd be out of luck until somebody gets around to adding the full version of the datasets.

seanmacavaney commented 3 years ago

I stand corrected on the topic of metadata & other fields. Some is provided for queries and docs under the metadata key.

seanmacavaney commented 3 years ago

@searchivarius reminds me that the BEIR doc objects ought to be better structured, especially RE the metadata.

Most are either (doc_id, text) or (doc_id, title, text). A few have (undocumented?) metadata as a dictionary, but should be able to properly structure these in a custom namedtuple for that particular corpus.

searchivarius commented 3 years ago

Thank you. I am currently fine with it, but if you add more structure in the future, this will be great.

allenai / ir_datasets

beir suite #58