castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.64k stars 365 forks source link

On indexing long documents #1540

Closed superhans closed 1 year ago

superhans commented 1 year ago

Do you have a guide, or recommend best practices for indexing long documents and searching within individual documents ?

So, let us say, I have a collection of long-documents of size 10000+ tokens each, and I want to do dense-retrieval on these.

Now, one way would be to chunk each long-document into 512 (or whatever) sized tokens and index each of these chunks. This is identical to the DPR case.

But doing it this way, at search time, I'm searching within all 512 sized-chunks across the entire corpus. What I would like to do, is, at search time, search only within a particular long-document (so in other words, filter by long-document first and then do the search).

superhans commented 1 year ago

Took a look at discussion here : https://github.com/castorini/pyserini/discussions/1372 and also relevant sections from your publication (https://cs.uwaterloo.ca/~jimmylin/publications/Ma_etal_SIGIR2022.pdf) which had the segmented ms_marco_v2 corpus.

I guess a rephrase of my original question is : If I were to ask a question about a particular long document, how do I ensure that all the other long-documents in my index are eliminated from the search ?