Open seanmacavaney opened 3 years ago
A downside of adding these is that they only consist of test components. If folks wanted to train on some of these (at least, the ones that have training data), they'd be out of luck until somebody gets around to adding the full version of the datasets.
I stand corrected on the topic of metadata & other fields. Some is provided for queries and docs under the metadata
key.
@searchivarius reminds me that the BEIR doc objects ought to be better structured, especially RE the metadata.
Most are either (doc_id, text)
or (doc_id, title, text)
. A few have (undocumented?) metadata as a dictionary, but should be able to properly structure these in a custom namedtuple for that particular corpus.
Thank you. I am currently fine with it, but if you add more structure in the future, this will be great.
Dataset Information:
Beir is a suite of benchmarks, intended to be used for testing zero-shot transfer.
These would help extend the tool beyond primarily ad-hoc tasks.
Their benchmarks perform their own pre-processing. For identical comparisons, we should use their same pre-processing (rather than aliasing with the versions we have, where there's overlap). It should be easy to support the datasets they have available as downloads.
Links to Resources:
Dataset ID(s):
beir
(empty placeholder)beir/msmarco
(docs, queries, qrels) --- the MS MARCO passage collection, dev subset; should correspond withmsmarco-passage/dev
, but there may be differences in pre-processingtrec-covid
(docs, queries, qrels) --- the TREC COVID complete benchmark; should correspond withtrec-covid
, but there may be differences in pre-processing. Plus not all metadata is available in their experimental setting, and it only uses the natural language questionsbeir/nfcorpus
(docs, queries, qrels) --- NFCorpus, unclear which is the corresponding irds ID, but presumably some filtered portion of the test set?beir/bioasq
(docs, queries, qrels)~ not available for downloadbeir/nq
(docs, queries, qrels) --- another version of the natural questions dev dataset; different preprocessing thannatural-questions/dev
anddpr-w100/natural-questions/dev
as document selection is different and different filtering for queriesbeir/hotpot
(docs, queries, qrels) --- HotpotQAbeir/fiqa
(docs, queries, qrels) --- FiQA-2018beir/signal1m
(docs, queries, qrels)~ not available for download --- Signal-1M(RT)beir/trec-news
(docs, queries, qrels)~ not available for download --- TREC Background Linkingbeir/arguana
(docs, queries, qrels) --- ArguAna Counterargument retrievalbeir/webis-touche2020
(docs, queries, qrels) --- Touche-2020 conversational argumentsbeir/cqadupstack
(docs, queries, qrels) --- CQADupstack community question answeringbeir/quora
(docs, queries, qrels) --- Quora duplicate question identificationbeir/dbpedia-entity
(docs, queries, qrels) --- DBPedia entity linkingbeir/scidocs
(docs, queries, qrels) --- SCIDOCS citation predictionbeir/fever
(docs, queries, qrels) --- FEVER fact verificationbeir/climate-fever
(docs, queries, qrels) --- Climate-FEVER fact verification on climate topicsbeir/scifact
(docs, queries, qrels) --- SciFact fact verification from scientific literatureSupported Entities
Additional comments/concerns/ideas/etc.
Need to be sure to include both the original dataset citation and the citation to Beir in the dataset documentation.
Could having several versions of the same dataset cause confusion? The documentation should provide information to disambiguate.