little question - Githubissues

What it's doing is building a file that maps document identifiers to text. One option is to extract it from an Indri index, if you have one.

These days, you might be better off using a tool like ir-datasets.

ir_datasets export "disks45/nocr/trec-robust-2004" docs --fields doc_id body > data/robust/documents.tsv

Of course, every tool you use has different rules on how it processes the documents, so YMMV.

Note that this repository is no longer maintained. It's simply a proof-of-concept of CEDR. You're better off using tools like PyTerrier these days.

Georgetown-IR-Lab / cedr