Closed JrJessyLuo closed 5 years ago
Hi,
You will first need to get a copy of the document collections. For Robust04, you will need to sign agreements with NIST to get a copy. Information can be found here. There's a similar process for ClueWeb09 and ClueWeb12 (since these collections are so large, you'll need to pay for the drives and shipment of the drives).
Once you have the data, you can extract the document content however you like into the data files. However, since you probably will have built indices anyway, we made a way to extract the content from the indices (extract_docs_from_index.py
). An easy guide for building Anserini indices from both the datasets can be found here and here (be sure to use the -storeTransformedDocs
flag).
OK,I got it.Thank you for you help.
In the guideline notes,be sure to use an index that has appropriate pre-processing. But I didn't know how to build the index file.I did't the robust04 or clubweb09 document dataset either. I want to replicate the experiment,but I can't work it out now because of this. Can anyone help me?