data - Githubissues

Georgetown-IR-Lab / cedr

Code for CEDR: Contextualized Embeddings for Document Ranking, accepted at SIGIR 2019.

MIT License

156 stars 28 forks source link

data #43

Open wangxinzhe123 opened 2 years ago

wangxinzhe123 commented 2 years ago

Because I want to run this code with other data sets, how can I get .run and .pair files similar to those in /data?

seanmacavaney commented 2 years ago

Hi- the format description of these files are given here: https://github.com/Georgetown-IR-Lab/cedr#getting-started

In short, training pairs are sampled from lines like [query-id] [doc-id] and run files are the standard TREC run format: [query-id] 0 [doc-id] [rank] [score] [runtag]. The latter can be the output of various retrieval systems, and the former can just be sampled from run files (depending on what you want to train with).

wangxinzhe123 commented 2 years ago

Does the .run and .pair files need to be built manually or automatically by running some program?

cmacdonald commented 2 years ago

There is also an integration plugin for CEDR using PyTerrier - see https://github.com/terrierteam/pyterrier_bert#cedr-usage (though its a little more dated compared to other PyTerrier plugins now)

seanmacavaney commented 2 years ago

@wangxinzhe123 -- ultimately how you construct these files depends on your experimental setup. The main questions are: 1) What results do you want CEDR to re-rank? 2) What data do you want CEDR to sample as training data?

wangxinzhe123 commented 2 years ago

Excuse me, can you provide the index file containing the indexbuildindex parameter?

seanmacavaney commented 2 years ago

That again depends on what experiment you're running -- especially since you mention that you're running it with different datasets.

Since you brought up Indri, here's documentation on it: https://sourceforge.net/p/lemur/wiki/IndriBuildIndex%20Parameters/

I'm not very familiar with Indri, however. I'm happy to help out using PyTerrier though -- especially if you provide some details on what you're trying to do. Here's the documentation on indexing: https://pyterrier.readthedocs.io/en/latest/terrier-indexing.html