capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

Custom training data for re-rankers #128

Closed vrdn-23 closed 3 years ago

vrdn-23 commented 3 years ago

Hey guys, Great work with the library! I just had a quick question regarding the rerankers module. Would it be possible for us to make the re-ranker train on our custom data? Currently I'm not sure where the training data for the re-rankers come from/or where it is being stored. Is this something that can be overridden using a command-line-argument or something?

TIA!

andrewyates commented 3 years ago

Thanks! Training data comes from the combination of a Collection module that specifies a document collection, and a Benchmark module that specifies topics (queries), folds, and qrels (relevance labels) to use with a collection.

You can use a reranker with custom data by creating classes for these modules, and then setting benchmark.name=yourNewBenchmark when you run the rerank task. (You won't need to specify the collection separately since the yourNewBenchmark class will have a dependency on yourNewCollection.) The robust04 dataset is the simplest reference: https://github.com/capreolus-ir/capreolus/blob/master/capreolus/collection/robust04.py https://github.com/capreolus-ir/capreolus/blob/master/capreolus/benchmark/robust04.py

One potential issue is that the collection needs to be in a format that Anserini understands; this is specified in the collection_type attribute. Same goes for the benchmark's topics_fn, but here only one format is supported. If your file isn't already in the TREC topics format, you can use topic_to_trectxt, which has been tested with Anserini's parsing code.

If you're feeling adventurous, the handling of these issues is improved in feature/ird2. This branch is working, but it isn't documented or finished yet. The IRDCollection and IRDBenchmark classes would be the place to start here.

vrdn-23 commented 3 years ago

Aah. I think I got it. Quick follow-up. If I wanted to keep an in-built collection but use a different set of training samples, I would have to register a new Sampler module. Is that right? Right now, I can see the way the samples are being generated are by random, but suppose I would want to implement my own way of sampling from the documents, that would be the way to go, correct?

andrewyates commented 3 years ago

Yep, that's right. The benchmark determines the qids that will be used (based on the fold), and the sampler samples query-doc triples to use as training data. If you're using a TF model, there's an additional shuffle that happens inside the TF trainer right now, but I think this could safely be commented out.

vrdn-23 commented 3 years ago

Thanks! That makes sense! Appreciate the quick turnaround!