castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.68k stars 377 forks source link

Add convenience method to get raw text from dense retrieval for prebuilt indexes #1856

Open lintool opened 7 months ago

lintool commented 7 months ago

This issue has come up more than once, the most recent being #1548

Our dense indexes don't store the raw text, but if it's a prebuilt index, we know the corresponding sparse index that has the text. It should be possible to implement a raw method that loads the corresponding sparse index to fetch the document.

Yuv-sue1005 commented 3 months ago

This issue is solved for faiss indexes through using the following code:

from pyserini.search.faiss import FaissSearcher

searcher = FaissSearcher.from_prebuilt_index('insert_faiss_index', 'insert_encoder')
doc = searcher.doc('insert_doc_id').raw()

Further testing should be done for other types of indexes.