Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
156 stars 28 forks source link

little question #46

Closed iconmzy closed 1 year ago

iconmzy commented 1 year ago

what is PATH_TO_INDRI_INDEX mean in my own project?

awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py indri PATH_TO_INDRI_INDEX > data/robust/documents.tsv

seanmacavaney commented 1 year ago

What it's doing is building a file that maps document identifiers to text. One option is to extract it from an Indri index, if you have one.

These days, you might be better off using a tool like ir-datasets.

ir_datasets export "disks45/nocr/trec-robust-2004" docs --fields doc_id body > data/robust/documents.tsv

Of course, every tool you use has different rules on how it processes the documents, so YMMV.

Note that this repository is no longer maintained. It's simply a proof-of-concept of CEDR. You're better off using tools like PyTerrier these days.