Add custom and prebuilt indices to PyseriniRetriever

jasper-xian commented 8 months ago

ref #38

commands:

from rank_llm.retrieve.pyserini_retriever import PyseriniRetriever, RetrievalMethod

# with old method
ret = PyseriniRetriever(dataset='fiqa', retrieval_method=RetrievalMethod.BM25)

# with custom index
ret = PyseriniRetriever(index_path='/path/to/fiqa/index', topics_path='beir-v1.0.0-fiqa-test', index_type='lucene')

# with prebuilt index
ret = PyseriniRetriever(index_path='beir-v1.0.0-fiqa.flat', topics_path='beir-v1.0.0-fiqa-test')

ret.retrieve_and_store(k=20)

sahel-sh commented 8 months ago

@jasper-xian thank you for working on this. a few high level questions/comments: 1- # with prebuilt index ret = PyseriniRetriever(index_path='beir-v1.0.0-fiqa.flat', topics_path='beir-v1.0.0-fiqa-test') What is the advantage of this method over the existing one? With the existing method, the beauty/simplicity is that the user does not need to worry about providing the index and topics path, that's why we have the indices and topics dictionaries. 2- I would like to leave pyserini for datasets with prebuilt indices as is and only add support for the custom indices 3- pyserini_retriever init is too comlex now. Please break it down to calling to private init functions: 1- init with dataset 2-init with index dir or something similar and call them inside the pyserini_retriver's init 4- Adding the custom index support for pyserini is needed but not enough. Ultimately the Retrieve class should have a static function Retriever.from_custom_index or something like this, and then a demo added for it. If you want to work on this item in a separate PR, it is fine by me as long as you have plans to work on it.

jasper-xian commented 8 months ago

@sahel-sh some answers to your questions:

I think the advantage is that you can now use almost any pyserini prebuilt index/topics to perform search, not limited by what the indices_dict provides. However, I kept the original simple RetrievalMethod and dataset method of retrieving as well (I just changed around the default values). This keeps the existing code supported and maintains its simplicity while adding the option to use any pyserini prebuilt index if needed.
See point 1, still maintains the current initialization of PyseriniRetriever. I can definitely change it back and just add user-indexes if needed though.
This is a good point, I'll update the PR
I can add this to the PR as well

sahel-sh commented 8 months ago

@sahel-sh some answers to your questions:

I think the advantage is that you can now use almost any pyserini prebuilt index/topics to perform search, not limited by what the indices_dict provides. However, I kept the original simple RetrievalMethod and dataset method of retrieving as well (I just changed around the default values). This keeps the existing code supported and maintains its simplicity while adding the option to use any pyserini prebuilt index if needed.

See point 1, still maintains the current initialization of PyseriniRetriever. I can definitely change it back and just add user-indexes if needed though.

Ok, I guess if rank_llm dictionaries go stale, users can still use your method for prebuild indices that are added to the pyserini recently and there is value in that. Let's keep it, but let's not expose it through a static function in retrieve class. I don't want the API to look confusing with too many options, if the customer needs to use this, they can create an instance PyseriniRetriever directly.

This is a good point, I'll update the PR

Thank you!

I can add this to the PR as well

SG!

sahel-sh commented 8 months ago

Thank you, Jasper for working on this!

castorini / rank_llm

Add custom and prebuilt indices to PyseriniRetriever #64