apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.58k stars 1.01k forks source link

Explore within-block skipping for postings #12486

Open mikemccand opened 1 year ago

mikemccand commented 1 year ago

Description

One of the differences between Tantivy and Lucene is that Tantivy supports within-block skipping (branchless binary search to locate a target docid inside a block of 128 postings), but Lucene skips only to block boundaries, and then does a linear scan within. Lucene is also doing the "accumulate docid deltas into the absolute docid" too in this loop, but I guess Tantivy does this separately somehow?

Anyway, we could explore within-block skipping as well -- it might make conjunctive queries where both terms have roughly the same cardinality, quite a bit faster?

jpountz commented 10 months ago

Lucene is also doing the "accumulate docid deltas into the absolute docid" too in this loop, but I guess Tantivy does this separately somehow?

I believe Tantivy does the same, except that it can take advantage of SIMD to accumulate docid deltas into the absolute docid (if it did not accumulate deltas up-front, it could not run a branchless binary search later on). I tried to look into whether we can do the same with Panama, but last time I checked it doesn't give ways to use _mm_slli_si128, which prevents us from making a faster prefix sum through vectorization: https://github.com/jpountz/vectorized-prefix-sum.

For reference, there is also this old @mkhludnev idea about encoding dense postings lists as bitsets, which would naturally help with skipping: #6116 (or can we do it on a per-block basis?). And more generally, there are some formats that are better at skipping within blocks like Elias-Fano.