Closed nguyenvo09 closed 5 years ago
Hi @nguyenvo09,
We provide extract_docs_from_index.py
to allow you to easily extract document content from an Anserini or Indri index. (You will need to have an index built for the initial document rankings anyway.) Please refer to Anserini documentation (here) or Indri documentation (here) for help building an index.
If you are using a different index format, we welcome contributions to extract_docs_from_index.py
to support additional formats!
Could you provide a link to download document.tsv
file so that I can quickly run experiments? Thank you very much.
Hi @nguyenvo09,
It is my understanding that I cannot distribute the datasets we used in the CEDR paper per the license agreements. You will need to go through proper channels to obtain them. For Robust04, you will need to sign agreements with NIST to get a copy. Information can be found here. There's a similar process for ClueWeb09 and ClueWeb12 (since these collections are so large, you'll need to pay for the drives and shipment of the drives).
There are also freely-available datasets that you could experiment with that were not in the paper. For instance, MS-MARCO, TREC CAR, and ANTIQUE.
I hope this helps!
When I run command
I got error like: