Open jpountz opened 1 year ago
I did a first indexing run on wikibigall with the following merge policy, which I tried to make as lightweight as possible:
BPIndexReorderer reorderer = new BPIndexReorderer();
reorderer.setMinDocFreq(16384);
reorderer.setMaxIters(3);
reorderer.setMinPartitionSize(8192);
mp = new BPReorderingMergePolicy(mp, reorderer, 131072);
Indexing ran in 3402170 msec vs. 2610068 msec without reordering, ie. 30% slower. (This is when running with default params, ie. maxBufferedDocs=12119, SerialMergeScheduler, LogDocMergePolicy (wrapped within BPReorderingMergePolicy), etc.) Search was noticeably faster:
TaskQPS baseline StdDevQPS my_modified_version StdDev Pct diff p-value
PKLookup 279.70 (1.6%) 247.15 (2.8%) -11.6% ( -15% - -7%) 0.000
HighTerm 441.90 (7.8%) 414.88 (4.4%) -6.1% ( -17% - 6%) 0.002
CountOrHighMed 94.53 (16.3%) 92.77 (15.9%) -1.9% ( -29% - 36%) 0.715
CountOrHighHigh 60.96 (16.4%) 60.03 (16.4%) -1.5% ( -29% - 37%) 0.768
HighTermMonthSort 4314.65 (2.3%) 4274.31 (2.8%) -0.9% ( -5% - 4%) 0.248
Respell 64.51 (1.1%) 64.38 (1.6%) -0.2% ( -2% - 2%) 0.654
CountPhrase 3.54 (11.2%) 3.59 (8.2%) 1.2% ( -16% - 23%) 0.700
Wildcard 71.79 (2.6%) 72.66 (2.7%) 1.2% ( -3% - 6%) 0.143
Fuzzy2 89.57 (0.9%) 90.80 (1.2%) 1.4% ( 0% - 3%) 0.000
Prefix3 123.64 (3.4%) 125.34 (2.8%) 1.4% ( -4% - 7%) 0.159
CountTerm 14193.65 (3.6%) 14589.84 (2.9%) 2.8% ( -3% - 9%) 0.007
IntNRQ 289.67 (6.0%) 299.22 (5.6%) 3.3% ( -7% - 15%) 0.074
HighPhrase 5.95 (7.6%) 6.16 (9.3%) 3.6% ( -12% - 22%) 0.180
Fuzzy1 104.33 (0.9%) 108.08 (1.2%) 3.6% ( 1% - 5%) 0.000
LowPhrase 17.70 (3.3%) 18.74 (4.9%) 5.9% ( -2% - 14%) 0.000
MedTerm 533.08 (7.8%) 568.40 (4.5%) 6.6% ( -5% - 20%) 0.001
OrHighHigh 56.43 (5.8%) 60.45 (7.0%) 7.1% ( -5% - 21%) 0.000
CountAndHighMed 124.71 (3.2%) 136.61 (4.5%) 9.5% ( 1% - 17%) 0.000
OrHighMed 212.88 (4.0%) 233.35 (5.0%) 9.6% ( 0% - 19%) 0.000
OrHighLow 604.12 (2.8%) 676.18 (4.5%) 11.9% ( 4% - 19%) 0.000
AndHighLow 933.07 (2.3%) 1046.85 (2.7%) 12.2% ( 7% - 17%) 0.000
LowTerm 947.45 (6.1%) 1091.11 (4.7%) 15.2% ( 4% - 27%) 0.000
AndHighMed 197.62 (3.1%) 232.11 (3.1%) 17.5% ( 10% - 24%) 0.000
MedPhrase 42.93 (2.7%) 50.47 (5.4%) 17.6% ( 9% - 26%) 0.000
AndHighHigh 52.74 (4.0%) 63.41 (4.6%) 20.2% ( 11% - 30%) 0.000
CountAndHighHigh 41.69 (3.5%) 50.60 (5.6%) 21.4% ( 11% - 31%) 0.000
HighTermDayOfYearSort 444.76 (1.7%) 652.11 (2.0%) 46.6% ( 42% - 51%) 0.000
I'll look into whether I can reduce the merge-time overhead.
This would be awesome to enable by default. It would somehow disable itself if the application sets its own static index sort?
It's odd/curious that PKLookup
got slower -- maybe the loss of locality/predictability of the primary key in luceneutil
makes BlockTree work harder for lookups. Given the sizable gains on the other queries, we can accept some loss of indexing performance in exchange for these search-time gains, by default?
It would somehow disable itself if the application sets its own static index sort?
This is correct. This bit already works on the PR, IndexWriter doesn't check the new method that OneMerge
got that allows reordering the resulting segment if an index sort is configured in the IndexWriterConfig
.
It's odd/curious that PKLookup got slower
The terms dictionary should be exactly the same, only doc IDs get renumbered. BlockTree optimizes for incremental IDs by storing the delta of a singleton doc ID with the previous doc ID (termState.singletonDocID += BitUtil.zigZagDecode(l >>> 1);
in Lucene90PostingsReader
). I wonder that it doesn't only help with storage, but also with lookups since it's stored as a single-byte vInt in the incremental case, but often on a variable number of bytes after reordering.
I was just checking out a profile, and with this lightweight BP configuration, we end up spending more time on building the forward index (essentially calling OfflineSorter
on all postings) and merging (because SortingCodecReader
introduces overhead as it needs to re-sort e.g. postings on the fly) than on reordering.
essentially calling OfflineSorter on all postings
FYI, I came up with some ideas to optimize this sort before, hoping to be helpful :)
maxDoc < (1 << 24)
.These sound like great ideas!
I initialized a PR on these ideas https://github.com/apache/lucene/pull/12712
As I expect out-of-the-box recursive graph bisection to be a bit controversial, I opened a PR to make it easier to use without making it the default: #13765.
Description
It would be nice to enable recursive graph bisection out of the box, so that users don't even have to know that it exists or what it is to enjoy its search-time performance benefits.
addDocument
is used.