How to configure TieredMergePolicy for very low segment count?

jpountz commented 1 week ago

Description

I have been experimenting with configuring TieredMergePolicy to keep the segment count very low:

segsPerTier = 2
floorSegmentSize = 512MB

This typically helps if you run queries that have a high per-segment overhead (vector search, multi-term queries) and have a low indexing throughput (especially if indexing and search run on separate hardware so that merges don't disturb searches).

Interestingly, an index that is less than 1GB can still have 10 segments with the above merge policy because of the constraint to not run merges where the resulting segment is less than 50% bigger than the biggest input segment. E.g. consider the following segment sizes: 100kB, 300kB, 800kB, 2MB, 5MB, 12MB, 30MB, 70MB, 150MB, 400MB. There is no pair of segments where the sum is more than 50% bigger than the max input segment.

I have bias against removing this constraint since containing write amplification is important to not run into quadratic merging, but I wonder if there are other ways how we could further reduce the number of segments.

For instance, TieredMergePolicy automatically takes the min of maxMergeAtOnce and numSegsPerTier as a merge factor, but it's not clear to me why this is important. If the merge policy allowed merges to have between 2 and 10 segments in the above example, it could find merges in the described segment structure, and this would likely help have lower write amplification for the same segment count?

Other ideas?

jpountz commented 1 week ago

Interestingly, it looks like this PR https://github.com/apache/lucene/pull/266 would do what I'm looking for, though it aimed at solving a different problem (!)

jpountz commented 1 week ago

I confirmed that #266 seems to help for this case. It does find merges to run with the above example (100kB, 300kB, 800kB, 2MB, 5MB, 12MB, 30MB, 70MB, 150MB, 400MB). And when I run our simulation tests in BaseMergePolicyTestCase, it ends up with a similar number of segments but at a lower write amplification.

jpountz commented 1 week ago

I ran the IndexGeoNames benchmark with 1 indexing thread, SerialMergeScheduler, 10k buffered docs, 100MB floor segment size, 2 segments per tier. This made the total indexing time go from 275s to 250s with this #266, which suggests that the reduced write amplification is indeed helping, despite flushing a single segment at once.

mikemccand commented 1 week ago

Interestingly, an index that is less than 1GB can still have 10 segments with the above merge policy because of the constraint to not run merges where the resulting segment is less than 50% bigger than the biggest input segment. E.g. consider the following segment sizes: 100kB, 300kB, 800kB, 2MB, 5MB, 12MB, 30MB, 70MB, 150MB, 400MB. There is no pair of segments where the sum is more than 50% bigger than the max input segment.

Hmm shouldn't the `floorSegmentSize=512MB' mean all of these segments are considered the "same size" (level) and all mergeable? Or does that 50% check trump the flooring?

For instance, TieredMergePolicy automatically takes the min of maxMergeAtOnce and numSegsPerTier as a merge factor, but it's not clear to me why this is important. If the merge policy allowed merges to have between 2 and 10 segments in the above example, it could find merges in the described segment structure, and this would likely help havea lower write amplification for the same segment count?

I think it did this in order to aim for an index geometry over time that has logarithmically sized levels (ish), where segment sizes tend to cluster into these close-ish levels? It mimics the behavior of the more carefully defined LogDoc/ByteSizeMergePolicy. But this doesn't seem intrinsic/important -- +1 to allow it to pick a range of segments to merge at once?

"Optimal" merging is hard! I wish we had the perfect merge policy that simply took as input what amortized write amplification is acceptable during indexing, and given that budget would aim to maximize search performance (by some approximate measure)... for apps using NRT segment replication, where one JVM indexes and N JVMs search (physically/virtually different instances) the efficiently incrementally replicated index, they would tolerate possibly very high write amplification (Amazon Product Search falls into this case). But for other apps that are indexing and searching on the same node, there's likely much less tolerance in burning so much CPU/IO to eek out a slightly more efficient index for searching ...

jpountz commented 4 days ago

Or does that 50% check trump the flooring?

It does trump the flooring indeed. The reasoning is that even if a user is happy to spend lots of hardware resources on merging, they would likely still be unhappy with merging running in quadratic time. That said, the 50% check may be too conservative. We could look into relaxing it or making it a function of numSegmentsPerTier (only increasing the segment size by 50% when you merge 10 segments at once sounds bad, less so when you merge 2 segments at once?).

jpountz commented 4 days ago

But this doesn't seem intrinsic/important -- +1 to allow it to pick a range of segments to merge at once?

I ran some quick simulations with maxMergeAtOnce > segsPerTier and this doesn't seem to hurt. Actually, it seems as good if not better at maintaining a low segment count while also keeping merge amplification lower than today.

apache / lucene

How to configure TieredMergePolicy for very low segment count? #14004

Description