capreolus-ir / capreolus

A toolkit for end-to-end neural ad hoc retrieval
https://capreolus.ai
Apache License 2.0
95 stars 32 forks source link

[Question] How to use preprocess method in Extractors? #137

Closed ali-abz closed 3 years ago

ali-abz commented 3 years ago

Hi there, I hope I'm not bothering you guys/gals with my novice questions. I am trying to create a Bert-based re-ranking model and I cannot understand how does preprocess method in Extractor module works. The documentation says that id2vec needs to be provided for an extractor. I investigated textbert and bertpassage extractors and id2vec depends on some sort of dictionary like self.docid2toks that are created by methods like preprocess, _build_vocab and such.

This part is a bit magical to me since I can not understand what class/module is calling this preprocess and what arguments exactly does the caller provide. I did a bit of testing for my extractor and preprocess was not invoked.

Also, for creating those dictionaries, self.index.get_doc and topics are used. I understand that self.index.get_doc can be provided via dependencies but I don't understand who is providing topics for us! I tested self.benchmark.topics['title'].get() instead and it works just fine, but using just topics is really neat. I would appreciate any comments, thanks.

andrewyates commented 3 years ago

No worries, we're happy to help with questions like this. The preprocess method is being called from the rerank task here: https://github.com/capreolus-ir/capreolus/blob/master/capreolus/task/rerank.py#L58 (this is also where topics is passed in)

I'm not sure why preprocess wasn't called for your custom extractor though. Could it be that a different extractor was running? My first thought is to double check that reranker.extractor.name=YourCustomOne was set somewhere, because it may be defaulting to a different extractor.

ali-abz commented 3 years ago

I see, thanks a lot. That explains why preprocess was not called since I don't have a re-ranker yet and was testing it by instantiating an object and not using the pipeline. I have to say, Capreolus is very well designed and written. Thanks for such a great tool.