Closed superhans closed 1 year ago
Took a look at discussion here : https://github.com/castorini/pyserini/discussions/1372 and also relevant sections from your publication (https://cs.uwaterloo.ca/~jimmylin/publications/Ma_etal_SIGIR2022.pdf) which had the segmented ms_marco_v2 corpus.
I guess a rephrase of my original question is : If I were to ask a question about a particular long document, how do I ensure that all the other long-documents in my index are eliminated from the search ?
Do you have a guide, or recommend best practices for indexing long documents and searching within individual documents ?
So, let us say, I have a collection of long-documents of size 10000+ tokens each, and I want to do dense-retrieval on these.
Now, one way would be to chunk each long-document into 512 (or whatever) sized tokens and index each of these chunks. This is identical to the DPR case.
But doing it this way, at search time, I'm searching within all 512 sized-chunks across the entire corpus. What I would like to do, is, at search time, search only within a particular long-document (so in other words, filter by long-document first and then do the search).