Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
155 stars 28 forks source link

Bug with extract_docs_from_index.py #11

Closed nguyenvo09 closed 4 years ago

nguyenvo09 commented 4 years ago

When I run command

awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py indri PATH_TO_INDRI_INDEX > data/robust/documents.tsv

I got error like:

Traceback (most recent call last):
  File "extract_docs_from_index.py", line 60, in <module>
    main_cli()
  File "extract_docs_from_index.py", line 48, in main_cli
    doc_extractor = INDEX_MAP[args.index_type](args.index_path)
  File "extract_docs_from_index.py", line 8, in indri_doc_extractor
    index = pyndri.Index(path)
  File "/root/anaconda3/lib/python3.6/site-packages/pyndri/__init__.py", line 52, in __init__
    super(Index, self).__init__(*args, **kwargs)
OSError: ../src/Parameters.cpp(469): Couldn't open parameter file 'indri-5.14/manifest' for reading.
it seems I need to do some indexing first. how could I do that? 
seanmacavaney commented 4 years ago

Hi @nguyenvo09,

We provide extract_docs_from_index.py to allow you to easily extract document content from an Anserini or Indri index. (You will need to have an index built for the initial document rankings anyway.) Please refer to Anserini documentation (here) or Indri documentation (here) for help building an index.

If you are using a different index format, we welcome contributions to extract_docs_from_index.py to support additional formats!

nguyenvo09 commented 4 years ago

Could you provide a link to download document.tsv file so that I can quickly run experiments? Thank you very much.

seanmacavaney commented 4 years ago

Hi @nguyenvo09,

It is my understanding that I cannot distribute the datasets we used in the CEDR paper per the license agreements. You will need to go through proper channels to obtain them. For Robust04, you will need to sign agreements with NIST to get a copy. Information can be found here. There's a similar process for ClueWeb09 and ClueWeb12 (since these collections are so large, you'll need to pay for the drives and shipment of the drives).

There are also freely-available datasets that you could experiment with that were not in the paper. For instance, MS-MARCO, TREC CAR, and ANTIQUE.

I hope this helps!