castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.65k stars 370 forks source link

Custom Index for Hybrid Search #477

Closed vrdn-23 closed 3 years ago

vrdn-23 commented 3 years ago

Hey everyone,

I was wondering if pyserini currently offers any functionality to operate on a custom FAISS index for dense search and hybrid search?

I am currently in the process of creating a FAISS index using the ANCE encoder for the TREC CAsT data (which I'd also be looking forward to add here once I have it up and running and confirm it works) and was wondering if there was a way for me to use this in tandem with the simple searcher offered by pyserini.

Thanks for the great work and looking forward to a reply!

MXueguang commented 3 years ago

Does this help? https://github.com/castorini/pyserini/blob/master/docs/usage-interactive-search.md#how-do-i-perform-dense-and-hybrid-retrieval (replace the TctColBertQueryEncoder with AnceQueryEncoder)

vrdn-23 commented 3 years ago

Thanks for the quick response!

Actually, let me be more clear. Don't we need a FAISS index already for the custom data (TREC CAST) I am working with, in order for the DenseSearcher to do retrieval? The index I have is not already loaded into the DINDEX info, and hence I would need to load it locally.

So I guess my question is, is there a substitute method/approach you would recommend other than 'from_prebuilt_index' to locally load a custom index I have? searcher = SimpleDenseSearcher.from_prebuilt_index( 'custom_data_trec_cast', encoder )

MXueguang commented 3 years ago

The from_prebuilt_index supports loading from local too. i.e. searcher = SimpleDenseSearcher.from_prebuilt_index( <path to local index>, encoder )

We don't have API for creating Faiss index within the scope of Pyserini package, but there are scripts to create the index: e.g. https://github.com/castorini/pyserini/blob/master/scripts/ance/encode_corpus_msmarco_passage.py

vrdn-23 commented 3 years ago

Thanks for that script btw! I understand now how to make it load my own index. I'm using a script of my own to create my Faiss Index so this helps clear up a couple of my own doubts.

I do have one question though: On line 28 in the script, shouldn't we also be passing in the attention_mask into the model, since we would be having pad tokens in the input batch too? I'm not sure if hugging face takes care of that internally, but just thought I should ask?

vrdn-23 commented 3 years ago

Aah. Nvm. I see that AnceEncoder is taking care of that in the forward! :)

vrdn-23 commented 3 years ago

Thanks again! I'll close the issue seeing as I got what I was looking for!