Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.
MIT License
155 stars 28 forks source link

How to process Robust04 dataset? #36

Closed yiyaxiaozhi closed 3 years ago

yiyaxiaozhi commented 3 years ago

I get the Robust04 dataset file as follow: 1.TREC-Disk-4.tar.gz 2.TREC-Disk-5.tar.gz and while unzip those files, I get a lot of files named "LAL010289" which contains many documents with HTML labels(,).

Could you give me some advice on what should I do next? Should I move the document files to the same file folder and install the Indri engine to index them? Thank you very much!

yiyaxiaozhi commented 3 years ago

I see the same issue in https://github.com/Georgetown-IR-Lab/cedr/issues/9, so i close it