On indexing long documents

castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.

Apache License 2.0

1.64k stars 365 forks source link

Do you have a guide, or recommend best practices for indexing long documents and searching within individual documents ?

So, let us say, I have a collection of long-documents of size 10000+ tokens each, and I want to do dense-retrieval on these.

Now, one way would be to chunk each long-document into 512 (or whatever) sized tokens and index each of these chunks. This is identical to the DPR case.

But doing it this way, at search time, I'm searching within all 512 sized-chunks across the entire corpus. What I would like to do, is, at search time, search only within a particular long-document (so in other words, filter by long-document first and then do the search).

castorini / pyserini

On indexing long documents #1540