Closed vrdn-23 closed 3 years ago
Thanks! Training data comes from the combination of a Collection
module that specifies a document collection, and a Benchmark
module that specifies topics (queries), folds, and qrels (relevance labels) to use with a collection.
You can use a reranker with custom data by creating classes for these modules, and then setting benchmark.name=yourNewBenchmark
when you run the rerank task. (You won't need to specify the collection separately since the yourNewBenchmark class will have a dependency on yourNewCollection.) The robust04 dataset is the simplest reference:
https://github.com/capreolus-ir/capreolus/blob/master/capreolus/collection/robust04.py
https://github.com/capreolus-ir/capreolus/blob/master/capreolus/benchmark/robust04.py
One potential issue is that the collection needs to be in a format that Anserini understands; this is specified in the collection_type
attribute. Same goes for the benchmark's topics_fn
, but here only one format is supported. If your file isn't already in the TREC topics format, you can use topic_to_trectxt, which has been tested with Anserini's parsing code.
If you're feeling adventurous, the handling of these issues is improved in feature/ird2
. This branch is working, but it isn't documented or finished yet. The IRDCollection
and IRDBenchmark
classes would be the place to start here.
Aah. I think I got it. Quick follow-up. If I wanted to keep an in-built collection but use a different set of training samples, I would have to register a new Sampler module. Is that right? Right now, I can see the way the samples are being generated are by random, but suppose I would want to implement my own way of sampling from the documents, that would be the way to go, correct?
Yep, that's right. The benchmark determines the qids that will be used (based on the fold), and the sampler samples query-doc triples to use as training data. If you're using a TF model, there's an additional shuffle that happens inside the TF trainer right now, but I think this could safely be commented out.
Thanks! That makes sense! Appreciate the quick turnaround!
Hey guys, Great work with the library! I just had a quick question regarding the rerankers module. Would it be possible for us to make the re-ranker train on our custom data? Currently I'm not sure where the training data for the re-rankers come from/or where it is being stored. Is this something that can be overridden using a command-line-argument or something?
TIA!