allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
316 stars 42 forks source link

TIPSTER corpus #241

Open breuert opened 1 year ago

breuert commented 1 year ago

Hi @seanmacavaney I am considering reproducing an experiment that uses the TIPSTER corpus used in TREC-3 and earlier tracks (https://catalog.ldc.upenn.edu/LDC93T3A). Apparently, the catalog does not feature TIPSTER or any of the earlier tracks. Did you already try to integrate them, and did it cause any problems? Or is there another reason that makes it impossible to integrate them?

As far as I know, the disks were distributed with different naming schemes. For instance, my copy of disks 4 and 5 have lower-cased file names, which is different from the format ir-datasets expects for "disks45/nocr/trec-robust-2004" (I copied the data as is from TREC's CD-ROMs). I remember that this issue was also discussed as part of OSIRRC back in 2019: https://github.com/osirrc/jig/issues/28

Similarly, my TIPSTER Vol. 1 - 3 copies are also lower-cased. Do you have any recommendations on which format to use if I try to add these datasets?

Many thanks, Timo

Dataset Information:

The TREC conferences emerged from the TIPSTER Text Program and this corpus is one of the first large-scale datasets that was curated for system evaluations. More information can be found here: https://www-nlpir.nist.gov/related_projects/tipster/trec.htm

Links to Resources:

https://trec.nist.gov/data/topics_eng/index.html https://trec.nist.gov/data/qrels_eng/index.html https://www-nlpir.nist.gov/related_projects/tipster/trec.htm https://catalog.ldc.upenn.edu/LDC93T3A

Dataset ID(s) & supported entities:

(I think other iterations and tracks based on TIPSTER could be added in a similar fashion.)

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.