Open wangxinzhe123 opened 2 years ago
Hi- the format description of these files are given here: https://github.com/Georgetown-IR-Lab/cedr#getting-started
In short, training pairs are sampled from lines like [query-id] [doc-id]
and run files are the standard TREC run format: [query-id] 0 [doc-id] [rank] [score] [runtag]
. The latter can be the output of various retrieval systems, and the former can just be sampled from run files (depending on what you want to train with).
Does the .run and .pair files need to be built manually or automatically by running some program?
There is also an integration plugin for CEDR using PyTerrier - see https://github.com/terrierteam/pyterrier_bert#cedr-usage (though its a little more dated compared to other PyTerrier plugins now)
@wangxinzhe123 -- ultimately how you construct these files depends on your experimental setup. The main questions are: 1) What results do you want CEDR to re-rank? 2) What data do you want CEDR to sample as training data?
Excuse me, can you provide the index file containing the indexbuildindex parameter?
That again depends on what experiment you're running -- especially since you mention that you're running it with different datasets.
Since you brought up Indri, here's documentation on it: https://sourceforge.net/p/lemur/wiki/IndriBuildIndex%20Parameters/
I'm not very familiar with Indri, however. I'm happy to help out using PyTerrier though -- especially if you provide some details on what you're trying to do. Here's the documentation on indexing: https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html
Because I want to run this code with other data sets, how can I get .run and .pair files similar to those in /data?