Open mikemccand opened 1 year ago
I looked into the code a little bit. Seems like the MAX_SKIP_LEVELS is hardcoded
Any good idea to easily set it to 1
? I think if we limit the level to 1
then the skip multiplier won't take effect.
In my experience AndHighLow
is the task that is the most likely to bottleneck on decoding postings. This is due to the fact that you often need to decode an entire block of postings of the High clause to check a single posting of the Low clause. I wonder if this has more to do with Tantivy's better decoding logic than with skipping.
The comparison with a single level skip list would still be interesting. Note that this will also affect block-max and, since Lucene also uses higher levels of skip lists to store score upper bounds of longer ranges of doc IDs. You will probably want to use the counting task to make the comparison more fair.
It might also be interesting to evaluate different encodings that don't have such a high up-front overhead per block such as roaring bitmaps or partitioned Elias-Fano.
In my experience AndHighLow is the task that is the most likely to bottleneck on decoding postings
You're absolutely right. Lucene is the clear winner if you select TOP_10_COUNT
and AndHighLow
in https://tony-x.github.io/search-benchmark-game/
Note that this will also affect block-max and, since Lucene also uses higher levels of skip lists to store score upper bounds of longer ranges of doc IDs. You will probably want to use the counting task to make the comparison more fair.
Tantivy also implements BMW so it holds the per-block upper bounds. This means if we reduce to single skip level for Lucene it'd be comparable
Set MAX_SKIP_LEVELS=1
in Lucene and ran the benchmark and control on a m6g.4xlarge
machine.
To setup the benchmark, I ran gradlew jar
in Lucene and copied the core jar to my benchmark workspace and added this line to build.gradle: implementation files("./libs/lucene-core-9.7.1-SNAPSHOT.jar")
TOP_10_COUNT
has the largest average performance gain from using a single level skip list, and COUNT
seems to slightly regress.
HighTerm, MedTerm and LowTerm being the much slower Lucene queries with this setup. With +52.45%, +46.50%, +30.73% avg latency respectively.
Attaching results.json results-skiplist-merg.json.zip
HighTerm, MedTerm and LowTerm being the much slower Lucene queries with this setup
This is because BMW was less efficient?
How is the performance of AndHighLow
affected?
This is because term queries check higher levels of impacts to skip more docs at once. For instance when collecting top-100 hits and after collecting 1000 blocks of postings, it's not unlikely that the best score at level 1 (8 level-0 blocks) is less than the minimum competitive score, so ImpactsDISI
would then skip 8 blocks at once.
@Tony-X and I were chatting about the difference in skipping in Tantivy (single level, fast binary search) and Lucene (multi-level, sort of complex implementation) and wondering whether that is causing the
AndHighLow
performance difference.We could test this by disabling Lucene's multi-level skip list at indexing time, or, rather, enabling it but configuring the level skip multiplier so high (
Integer.MAX_VALUE
?) so that it only ever uses a single skip level.