castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.57k stars 349 forks source link

Add raw() to DenseSearchResult and PRFDenseSearchResult #1876

Closed DanielKohn1208 closed 2 months ago

DanielKohn1208 commented 2 months ago

closes #1856 by adding a raw method to both DenseSearchResult and PRFDenseSearchResult

Had to change how DenseSearchResult and PRFDenseSearchResult is initialized as well.

lintool commented 2 months ago

hey @manveertamber can you take a look at this?

manveertamber commented 2 months ago

@DanielKohn1208 can you add some documentation to explain how to fetch document content for documents retrieved using a dense index: https://github.com/castorini/pyserini/blob/master/docs/usage-fetch.md. It looks like right now we only explain how to do this for a lucene index

manveertamber commented 2 months ago

We can call .doc() on a FaissSearcher too, which routes to the lucene index .doc() but the documentation doesn't explain this either.

manveertamber commented 2 months ago

@DanielKohn1208 looks good! But maybe change the headings in the documentation to Using a Lucene Index and Using a Faiss Index instead of Using a Sparse Representation and Using a Dense Representation. And you should add some example code in the documentation that showcases your changes. Like calling .raw() on some retrieved hits.

DanielKohn1208 commented 2 months ago

@manveertamber After going through the documentation, I think that I approached this issue in completely the wrong way. As it is now, my fix actually doesn't actually solve any problems. Someone could already get a hit's raw content from a FaissIndex as follows

from pyserini.search.faiss import FaissSearcher, AutoQueryEncoder

DENSE_INDEX = 'beir-v1.0.0-nfcorpus.bge-base-en-v1.5'
encoder = AutoQueryEncoder('BAAI/bge-base-en-v1.5', device='cpu', pooling='mean', l2_norm=True)
searcher = FaissSearcher.from_prebuilt_index(DENSE_INDEX, encoder)
hits = searcher.search('How to Help Prevent Abdominal Aortic Aneurysms')
doc = searcher.doc(hits[0].docid)
raw_content = doc.raw()

Also note that my original assumption about the LuceneSearcher.search() method returning a list of objects with a raw() method was incorrect. Obtaining the raw content of one of these hits would be similar to above code.

Perhaps I am completely misunderstanding the original issue, but I think all that is actually needed is a documentation update?

manveertamber commented 2 months ago

Yeah I agree. https://github.com/castorini/pyserini/issues/1548 wanted to get passage text from a DenseSearchResult, but the right way to do this is to just to call searcher.doc(docid), which is possible with a FaissSearcher, we just needed the documentation to reflect this. @lintool is this enough to close https://github.com/castorini/pyserini/issues/1856?

DanielKohn1208 commented 2 months ago

Should I close this PR and create a new one at some point to update the docs?

manveertamber commented 2 months ago

Should I close this PR and create a new one at some point to update the docs?

Yeah that works.

DanielKohn1208 commented 2 months ago

Closing PR