Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
155 stars 28 forks source link

how to get a pre-processing index file for lucene or indri #9

Closed JrJessyLuo closed 5 years ago

JrJessyLuo commented 5 years ago

In the guideline notes,be sure to use an index that has appropriate pre-processing. But I didn't know how to build the index file.I did't the robust04 or clubweb09 document dataset either. I want to replicate the experiment,but I can't work it out now because of this. Can anyone help me?

seanmacavaney commented 5 years ago

Hi,

You will first need to get a copy of the document collections. For Robust04, you will need to sign agreements with NIST to get a copy. Information can be found here. There's a similar process for ClueWeb09 and ClueWeb12 (since these collections are so large, you'll need to pay for the drives and shipment of the drives).

Once you have the data, you can extract the document content however you like into the data files. However, since you probably will have built indices anyway, we made a way to extract the content from the indices (extract_docs_from_index.py). An easy guide for building Anserini indices from both the datasets can be found here and here (be sure to use the -storeTransformedDocs flag).

JrJessyLuo commented 5 years ago

OK,I got it.Thank you for you help.