Closed iconmzy closed 1 year ago
What it's doing is building a file that maps document identifiers to text. One option is to extract it from an Indri index, if you have one.
These days, you might be better off using a tool like ir-datasets.
ir_datasets export "disks45/nocr/trec-robust-2004" docs --fields doc_id body > data/robust/documents.tsv
Of course, every tool you use has different rules on how it processes the documents, so YMMV.
Note that this repository is no longer maintained. It's simply a proof-of-concept of CEDR. You're better off using tools like PyTerrier these days.
what is PATH_TO_INDRI_INDEX mean in my own project?
awk '{print $3}' data/robust/*.run | python extract_docs_from_index.py indri PATH_TO_INDRI_INDEX > data/robust/documents.tsv