Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments [LUCENE-7976]

asfimport commented 7 years ago

We're seeing situations "in the wild" where there are very large indexes (on disk) handled quite easily in a single Lucene index. This is particularly true as features like docValues move data into MMapDirectory space. The current TMP algorithm allows on the order of 50% deleted documents as per a dev list conversation with Mike McCandless (and his blog here: https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).

Especially in the current era of very large indexes in aggregate, (think many TB) solutions like "you need to distribute your collection over more shards" become very costly. Additionally, the tempting "optimize" button exacerbates the issue since once you form, say, a 100G segment (by optimizing/forceMerging) it is not eligible for merging until 97.5G of the docs in it are deleted (current default 5G max segment size).

The proposal here would be to add a new parameter to TMP, something like <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions welcome) which would default to 100 (or the same behavior we have now).

So if I set this parameter to, say, 20%, and the max segment size stays at 5G, the following would happen when segments were selected for merging:

> any segment with > 20% deleted documents would be merged or rewritten NO MATTER HOW LARGE. There are two cases, >> the segment has <5G "live" docs. In that case it would be merged with smaller segments to bring the resulting segment up to 5G. If no smaller segments exist, it would just be rewritten >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). It would be rewritten into a single segment removing all deleted docs no matter how big it is to start. The 100G example above would be rewritten to an 80G segment for instance.

Of course this would lead to potentially much more I/O which is why the default would be the same behavior we see now. As it stands now, though, there's no way to recover from an optimize/forceMerge except to re-index from scratch. We routinely see 200G-300G Lucene indexes at this point "in the wild" with 10s of shards replicated 3 or more times. And that doesn't even include having these over HDFS.

Alternatives welcome! Something like the above seems minimally invasive. A new merge policy is certainly an alternative.

Migrated from LUCENE-7976 by Erick Erickson (@ErickErickson), 7 votes, resolved Jun 17 2018 Attachments: LUCENE-7976.patch (versions: 13), SOLR-7976.patch Linked issues:

SOLR-8839
- 9417
- SOLR-12513
- 9052
- SOLR-12259
- SOLR-7733
- 9310
- SOLR-7733

asfimport commented 7 years ago

Nik Everett (@nik9000) (migrated from JIRA)

I had this issue on a previous project. Our indices were smaller than what you are talking about but we did have one or two of the max size segments that refused to merge away their deleted documents until they got to 50%. We had a fairly high update rate and a very high query rate. The deleted documents bloated the working set size somewhat causing more IO which was our bottleneck at the time. I would have been happy to pay for the increased merge IO to have lower query time IO.

We ultimately solved the problem by throwing money at it. More ram and better SSDs makes life much easier. I would have liked to have solved the problem in software but as an very infrequent contributor I didn't feel like I'd ever get a change to TieredMergePolicy merged.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I would have liked to have solved the problem in software but as an very infrequent contributor I didn't feel like I'd ever get a change to TieredMergePolicy merged.

Please don't think like that :) Good ideas are good ideas regardless of who they come from!

It's too bad people call forceMerge and get themselves into this situation to begin with ;) Maybe we should remove that method! Or maybe the index should be put into a read-only state after you call it?

Anyway, +1 to add another option to TMP. Maybe it should apply to the whole index? I.e. the parameters states that the index at all times should have less than X% deletions overall? This way TMP is free to merge whichever segments will get it to that, but that would typically mean merging the big segments since they have most of the deletes.

The danger here is that if you set that parameter to 0 the results are catastrophic merging. Maybe we place a lowerbound (20%) on how low you can set that parameter?

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

-1 to making the index read-only. It's just too easy to get in that trap and be stuck forever. Maybe a really-strongly worded warning in the docs and the Solr admin UI?

-1 for taking the option away. I see far too many situations where users index rarely, say once a day and want to optimize to squeeze every last bit of performance they can. And maybe take the option out Solr's admin UI, but that's a separate issue.

I think one of the critical bits is to rewrite segments that are > X% deleted, no matter how big. At least that gives people a way to recover, albeit painfully. Whatever solution needs to do that I think.

As for whether X% is per segment or index-wide I don't have any strong preferences. Enforcing that on a per-segment basis would automatically make it true for the entire index, but doing it index-wide would allow for less rewrites, let's say you have one segment, particularly a small one with 50% deleted docs that make up 1% of the whole index. There's not much need to rewrite it.....

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think the main issue here is that only Solr still call the option "optimize" in the update request handler which is misleading. Maybe change that to be not so "oh that's a good thing it makes everything better" option.

I know the issue, so the first thing I tell solr customers is: "never ever call optimize unless your index is static."

asfimport commented 7 years ago

David Smiley (@dsmiley) (migrated from JIRA)

"I think the main issue" ... I disagree; this issue is about freeing up many deleted docs. Uwe, feel free of course to create a Solr issue to rename "optimize" to "forceMerge" and to suggest where the Solr Ref Guide's wording is either bad or needs improvement. I think these are clearly separate from this issue.

asfimport commented 7 years ago

Timothy M. Rodriguez (migrated from JIRA)

Agreed, it's not strictly a result of optimizations. It can happen for large collections or with many updates to existing documents.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It can happen for large collections or with many updates to existing documents.

Hmm can you explain how? TMP should produce max sized segments of \~5 GB, and allow at most 50% deleted documents in them, at which point they are eligible for merging.

Doing a forceMerge yet then continuing to add documents to your index can result in a large (> 5 GB) segment with more than 50% deletions not being merged away.

But I don't see how this can happen if you didn't do a forceMerge in the past?

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

It's too bad people call forceMerge and get themselves into this situation to begin with Maybe we should remove that method! Or maybe the index should be put into a read-only state after you call it?

I know the issue, so the first thing I tell solr customers is: "never ever call optimize unless your index is static."

The read-only idea is really cool, maybe consider deprecating forceMerge() and adding freeze()? I think this removes the trap completely and still allows for use-cases where people just want less segments for the read-only case.

asfimport commented 7 years ago

Michael Sokolov (@msokolov) (migrated from JIRA)

How about having forceMerge() obey max segment size. If you really want to merge down to one segment, you have to change the policy to increase the max size.

asfimport commented 7 years ago

Timothy M. Rodriguez (migrated from JIRA)

If a collection has many 5GB segments, it's possible for many of them to be at less than 50% but still accumulate a fair amount of deletes. Increasing the max segment helps, but increases the amount of churn on disk through large merges.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How about having forceMerge() obey max segment size. If you really want to merge down to one segment, you have to change the policy to increase the max size.

+1, that makes a lot of sense. Basically TMP is buggy today because it allows forceMerge to create too-big segments.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

If a collection has many 5GB segments, it's possible for many of them to be at less than 50% but still accumulate a fair amount of deletes. Increasing the max segment helps, but increases the amount of churn on disk through large merges.

Right, but that's a normal/acceptable index state, where up to 50% of your docs are deleted.

What this bug is about is cases where it's way over 50% of your docs that are deleted, and as far as I know, the only way to get yourself into that state is by doing a forceMerge and then continuing to update/delete documents.

asfimport commented 7 years ago

Michael Braun (@michaelbraun) (migrated from JIRA)

@mikemccand thought this issue was about the case where you have segments that are effectively unmergeable and that stick around at < 50% deletes? We have seen this in our production systems where these segments which are at the segment size limit sick around and waste not only disk resources but throw off term frequencies because the policy does not merge at the lower delete level. Would love a way to specify that segments which would normally be unmergeable should still be considered for operations in the event the number of deletes passes a (lower) threshold.

asfimport commented 7 years ago

Mike Sokolov (@msokolov) (migrated from JIRA)

Is it reasonable to modify the delete percentage in the policy while leaving the max in place?

– Sent from my Android device with K-9 Mail. Please excuse my brevity.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

I linked in SOLR-7733 for dealing with the admin UI optimize button (I favor removing it entirely, make people put in some effort to back themselves into a corner).

re: read-only rather than optimize.....

It may be the cases I've seen where users think optimize gives a speed improvement are really the result of squeezing out the deleted documents. Question for the Lucene folks, what would you guess the performance differences would be between.

a single 200G segment? 40 5G segments?

With no deleted documents? I see indexes on disk at that size in the wild.

If the perf in the two cases above is "close enough" then freezing rather than optimize is an easier sell. The rest of this JIRA is about keeping the % deleted documents small, which, if we do, would handle the perf issues people get currently from forceMerge, assuming the above.

@msokolov The delete percentage isn't really the issue currently, if TMP respects max segment size it can't merge two segments > 50% live docs. If TMP were tweaked to merge unlike size segments when some % deleted docs is exceeded in the large one (i.e. merge a segment with 4.75G live docs with a segment with 0.25G live docs) we could get there.

@mikemccand:

bq: Right, but that's a normal/acceptable index state, where up to 50% of your docs are deleted

Gotta disagree with acceptable, normal I'll grant. We're way over indexes being terabytes and on our way to petabytes. I have cases where they're running out of physical room to add more disks. Saying that half your disk space being occupied by deleted documents is a hard sell.

asfimport commented 7 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

"I think the main issue" ... I disagree; this issue is about freeing up many deleted docs. Uwe, feel free of course to create a Solr issue to rename "optimize" to "forceMerge" and to suggest where the Solr Ref Guide's wording is either bad or needs improvement. I think these are clearly separate from this issue.

Sorry, it is always caused by calling "optimize" or "forceMerge" at some point in the past. Doing this always brings the index into a state where the deletes sum up, because its no longer in an ideal state for deleting an adding new documents. If you never call forceMerge/optifucke (sorry "optimize" haha), the deletes won't automatically sum up, as TieredMergePolicy will merge them away. The deleted documents ratio is in most cases between 30 and 40% on the whole index in that case. But if you force merge it gets bad and you sometimes sum up 80% deletes. The reason was described before.

And for that reason it is way important to remove "optimize" from Solr, THIS issue won't happen without "optifucke"! PERIOD.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

There are two issues here that are a bit conflated; the consequences of forceMerge and having up to 50% of your index space used up by deleted docs:

1> If they do optimize/forcemerge/expungeDeletes, they're stuck. Totally agree that having a big red button makes that way too tempting. Even if removed, users can still use the optimize call from the SolrJ client and/or via the update handler. So one issue is if there are ways to prevent the unfortunate consequences (the freeze idea, only optimize into segments max segment size etc) or recover somehow (some of the proposals above). Keeping the number of deleted docs lower would make pressing that button less tempting, but the button still should be removed. There are ways to forceMerge even if removed though.

2> Even if they don't forcemerge/expungeDeletes, having 50% of the index consumed by deleted docs can be quite costly. Telling users that they have only two choices, 1> start and keep optimizing or 2> buy enough hardware that they can meet their SLAs with half their index space wasted is a hard sell. We have people who need 100s of machines in their clusters to hit their SLAs. Accepting up to 50% deleted docs as the norm means potentially millions of dollars in unnecessary hardware.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

There are plenty of use-cases for a forceMerge or optimize to be done in either special cases, or on a fixed schedule. It's a deficiency that the default merge policy can't deal more intelligently with that. Merge policies are pluggable though, so we may be able to deal with this at either the Lucene or Solr level. No need for 100% of all devs to agree ;-)

any segment with > X% deleted documents would be merged or rewritten NO MATTER HOW LARGE.

+1 for the idea... I haven't thought about all the ways it might interact with other things, but I like it in general. Segments with X% deleted docs will be candidates for merging. Max segment sizes will still be targeted of course, so if it's estimated size after merging with smaller segments is less than the max seg size, we're good. If not, merge it by itself (i.e. expungeDeletes).

asfimport commented 7 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

There are two scenarios that are being discussed here:

Users having large indexes and segments having <50% deleted docs. They aren't getting cleaned away because segments have become 5G. The absolute number of deleted docs is very high in this index because they were large to begin with.
A user called optimize and now that one big segment will never get merged away.

Both are similar but the latter has got to do with users running the optimize command. Re-naming the command from the Solr side and other changes is important here.

But the first scenario is what I've now seen at two clusters recently so I'd like to tackle this.

We have a default on what the max segment size should be which is really nice. However I'm not convinced that adding a new setting which merges two segments when it reaches a delete threshold is a good idea. It works for this scenario but now we'll have a segment that's 8GB in size and then two 8GB segments will merge into a 14GB segment etc. The merge times will increase and potentially over the period of time could be harmful?

Instead what if the delete threshold worked like this: if we can't find any eligible merges , pick a segment which is 5G in size and more than the threshold deletes and rewrite just that segment. So now the 5G segment will become 4G effectively purging he documents. Also keep a lower bound check so users can't set a delete threshold below 20%.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

However I'm not convinced that adding a new setting which merges two segments when it reaches a delete threshold is a good idea. It works for this scenario but now we'll have a segment that's 8GB in size and then two 8GB segments will merge into a 14GB segment etc.

That would be a bad idea, but I'm not sure anyone proposed that. Looks to me like what both Erick & I said was that the max segment size would still be respected.

Instead what if the delete threshold worked like this: if we can't find any eligible merges , pick a segment which is 5G in size and more than the threshold deletes and rewrite just that segment. So now the 5G segment will become 4G effectively purging he documents. Also keep a lower bound check so users can't set a delete threshold below 20%.

It seems simpler to do what I proposed above: make the segment a candidate for merging. If no other segments can be merged with it and keep under 5G, then it will be merged by itself. But it could also be merged with other segments if the resulting size is estimated to be under the cap. Looking back at Erick's rules first proposed, it looks like the same thing actually (same result, but just a different way of looking at it).

asfimport commented 7 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

+1 . I'll try to work on it in the next few days

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

What Yonik said.

+1 to working up a patch. I actually think this is pretty important.

bq: Also keep a lower bound check so users can't set a delete threshold below 20%.

Don't know. This is another arbitrary decision that may or may not apply. Perhaps a strongly worded suggestion that this be the lower bound and a WARN message on startup if they specify <20%? 20% of a 10TB (aggregate across shards) index is still a lot. I don't have strong feelings here though.

Hmmm. If you have a setter like setMaxDeletePctBeforeSingletonMerge(double pct) then through reflection you can just specify <double name="maxDeletePctBeforeSingletonMerge>5</double> in the merge policy and it'll automagically get picked up. Then we don't advertise it, making it truly expert....

bq: ...pick a segment which is 5G in size and more than the threshold deletes...

Minor refinement. Pick a segment > 2.5G "live" documents and > X% deleted docs and merge it. That way we merge a 4G segment with 20% deleted into a 3.2G segment. Rinse and repeat until it had < 2.5G live docs at which point it's eligible for regular merging.

The sweet thing about this is that it would allow users to recover from an optimize. Currently if they do hit that big red button and optimize they can't recover deleted documents until that single huge segment has < 2.5G live docs. Something like this will keep rewriting that segment into smaller and smaller (though still large) segments and it'll eventually disappear. Mind you it'll be painful, but at least it'll eventually get there.

I'm not sure whether to make this behavior the default for TieredMergePolicy or not. Other than rewriting very large segments, the current policy is essentially this with X being 50%. Despite my comments about keeping reflection above, WDYT about just making this explicit? That is, default a parameter like "largeSegmentMaxDeletePct" to 50?

And for a final thought, WDYT about Mike's idea of making optimize/forcemerge/expungeDeletes respect maxSegemntSize? I think we still need to rewrite segments as this JIRA proposes since the current policy can hover around 50%. I'm lukewarm to making optimize respect max segment size since it would change that behavior, but I don't have strong feelings on it.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

And for a final thought, WDYT about Mike's idea of making optimize/forcemerge/expungeDeletes respect maxSegemntSize?

I think that would be great to be able to specify it per-operation. That way one could do minor or major forceMerges/optimizes on different schedules or for different reasons. The current maxSegSize could just be a default.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't think we can allow different max segment sizes for forced merges and natural merges; that's effectively the state we are in today and it causes the bug (case 1) we have here, because natural merging can't touch the too-big segments. I think we need to fix forceMerge, and findForcedDeletesMerges, to respect the maximum segment size, and if you really want a single segment and your index is bigger than 5 GB (default max segment size), you need to increase that maximum. This would solve case 1 (the "I ran forceMerge and yet continued updating my index" situation).

For case 2, if we also must solve the "even 50% deletions is too much for me" case (and I'm not yet sure we should... Lucene is quite good at skipping deleted docs during search), maybe we could simply relax TMP so that even max sized segments that have < 50% deletions are eligible for merging. Then, they would be considered for natural merging right off, and users could always (carefully!) tune up the reclaimDeletesWeight to more aggressively target segments with deletions.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Mike:

bq: Lucene is quite good at skipping deleted docs during search....

That's not the nub of the issue for me. I'm seeing very large indexes, 200-300G is quite common lately on a single core. We have customers approaching 100T indexes in aggregate in single Solr collections. And that problem is only going to get worse as hardware improves and super-especially if Java's GC algorithm evolves to work smoothly with larger heaps. BTW, this is not theoretical, I have a client using Azul's Zing with Java heaps approaching 80G. It's an edge case to be sure, but similar will become more common.

So 50% deleted documents consumes a lot of resources, both disk and RAM when considered in aggregate at that scale. I realize that any of the options here will increase I/O, but that's preferable to having to provision a new data center because you're physically out of space and can't add more machines or even attach more storage to current machines.

bq: maybe we could simply relax TMP so that even max sized segments that have <50% deletions are eligible for merging

Just to be sure I understand this... Are you saying that we make it possible to merge, say, one segment with 3.5G and 5 other segments each 0.3G? That seems like it'd work.

That leaves finding a way out of what happens when someone actually does have a huge segment as a result of force merging. I know, I know, "don't do that" and "get rid of the big red optimize button in the Solr admin screen and stop talking about it!". I suppose your suggestion can tackle that too if we define an edge case in your "relax TMP so that...." idea to include a "singleton merge" if the result of the merge would be > max segment size.

Thanks for your input! Let's just say I have a lot more faith in your knowledge of this code than mine......

asfimport commented 7 years ago

Timothy M. Rodriguez (migrated from JIRA)

An additional place where deletions come up is in replica differences due to the way merging happened on a shard. This can cause jitter in results where the ordering will depend on which shard answered a query because the frequencies are off significantly enough. I know this problem will never go away completely as we can't flush away deletes immediately, but allowing some reclamation of deletes in large segments will help minimize the issue.

On max segment size, I also think the merge policy ought to dutifully respect maxSegmentSize. If we don't, other smaller bugs can come up for users, such as ulimits on file size, that they thought they were safely under.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

bq: If we don't, other smaller bugs can come up for users, such as ulimits on file size, that they thought they were safely under.

Max segment sizes are a target, not a hard guarantee... Lucene doesn't know exactly how big the segment will be before it actually completes the merge, and it can end up going over the limit.

asfimport commented 7 years ago

Timothy M. Rodriguez (migrated from JIRA)

I didn't know that! Thanks for pointing out.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The max segment size is great for a number of reasons:

By default, prevents an unpredictable huge cascading merge when the user doesn't want it
By default, prevents a huge segment if the user never wants huge segments

The downside to a max segment size is that one can start getting many more segments than anticipated or desired (and can impact performance in unpredictable ways, depending on the exact usage). If a user specifically asks to forceMerge (i.e. they realized they have 200 segments and they want to bring that down to 20), then that should be respected.

asfimport commented 7 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Very interesting discussion and problem.

If we ignore for a moment what TMP actually does, and back up to the design intent when the policy was made ... what would the designer have wanted to happen in the case of a segment that's considerably larger than the configured max size? Took me a while to find the right issue, which is #1929, work by @mikemccand.

I suspect that the current behavior, where a segment that's 20 times larger than the configured max segment size is ineligible for automatic merging until 97.5 percent deleted docs, was not actually what was desired. Indexes with a segment like might not have even been considered when TMP was new. I don't see anything in #1929 that mentions it. I haven't checked all the later issues where changes to TMP were made.

So, how do we deal with this problem? I see three options. We can design an entirely new policy, and if its behavior becomes preferred, consider changing the default at a later date. We can change TMP so it behaves better with very large segments with no change in user code or config. We can add Erick's suggested option. For any of these options, improved documentation is a must.

The second option (and the latter half of the first option) carries one risk factor I can think of – users complaining about new behavior in a similar manner to what I've heard about when the default directory was changed to MMAP.

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

There are more options Shawn. Its a bug that we created this 20x too big segment to begin with. The configured merge policy is not configured to create a segment that big. @msokolov suggestion about fixing that seems like the correct fix.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

It's not a bug, it's a feature. It's an explicit request that may or may not be a mistake on the part of the user, and it can certainly be a judgement call. Given that it's explicit and we don't know if is advisable or not, we should do what is requested.

The root cause of the problem here seems to be that we have only one variable (maxSegmentSize) and multiple use-cases we're forcing on it: 1) the max segment size that can be create automatically just by adding documents (this is maxSegmentSize currently) 2) the max segment size that can ever be created, even through explicit forceMerge (this is more for Tim's usecase... certain filesystems or transports may break if you go over certain limits)

There is no variable/setting for #2 currently, but we should not re-use the current maxSegmentSize for this as it conflates the two use-cases. Perhaps something like hardMaxSegmentSize or something?

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I don't agree its a feature. The documentation for IndexWriter.forceMerge states:

Forces merge policy to merge segments until there are <= maxNumSegments. The actual merges to be executed are determined by the MergePolicy.

I bolded sentence two just for emphasis.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The actual merges to be executed are determined by the MergePolicy.

And so then we go and look a the merge policy in question (TieredMergePolicy) which says:

 *  <p><b>NOTE</b>: This policy always merges by byte size
 *  of the segments, always pro-rates by percent deletes,
 *  and does not apply any maximum segment size during
 *  forceMerge (unlike {`@link` LogByteSizeMergePolicy}).

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

The root cause of the problem here seems to be that we have only one variable (maxSegmentSize) and multiple use-cases we're forcing on it: 1) the max segment size that can be create automatically just by adding documents (this is maxSegmentSize currently) 2) the max segment size that can ever be created, even through explicit forceMerge (this is more for Tim's usecase... certain filesystems or transports may break if you go over certain limits)

Actually, looking at the other merge policy, LogByteSizeMergePolicy, it already has different settings for these different concepts/use-cases:

setMaxMergeMB()
setMaxMergeMBForForcedMerge()

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This can cause jitter in results where the ordering will depend on which shard answered a query because the frequencies are off significantly enough.

Segment-based replication (http://blog.mikemccandless.com/2017/09/lucenes-near-real-time-segment-index.html) would improve this situation, in that the jitter no longer varies by shard since all replicas search identical point-in-time views of the index. It's also quite a bit more efficient if you need many replicas.

I suspect that the current behavior, where a segment that's 20 times larger than the configured max segment size is ineligible for automatic merging until 97.5 percent deleted docs, was not actually what was desired.

Right! The designer didn't think about this case because he didn't call forceMerge so frequently :)

Max segment sizes are a target, not a hard guarantee... Lucene doesn't know exactly how big the segment will be before it actually completes the merge, and it can end up going over the limit.

Right, it's only an estimate, but in my experience it's conservative, i.e. the resulting merged segment is usually smaller than the max segment size, but you cannot count on that.

The downside to a max segment size is that one can start getting many more segments than anticipated or desired (and can impact performance in unpredictable ways, depending on the exact usage).

Right, but the proposed solution (TMP always respects the max segment size) would work well such users: they just need to increase their max segment size if they need to get a 10 TB index down to 20 segments.

So 50% deleted documents consumes a lot of resources, both disk and RAM when considered in aggregate at that scale.

Well, disks are cheap and getting cheaper. And 50% is the worst case – TMP merges those segments away once they hit 50%, so that the net across the index is less than 50% deletions. Users already must have a lot of free disk space to accommodate running merges, pending refreshes, pending commits, etc.

Erick, are these timestamp'd documents? It's better to index those into indices that rollover with time (see how Elasticsearch recommends it: https://www.elastic.co/blog/managing-time-based-indices-efficiently), where it's far more efficient to drop whole indices than delete documents in one index.

Still, I think it's OK to relax TMP so it will allow max sized segments with less than 50% deletions to be eligible for merging, and users can tune the deletions weight to force TMP to aggressively merge such segments. This would be a tiny change in the loop that computes tooBigCount.

The root cause of the problem here seems to be that we have only one variable (maxSegmentSize) and multiple use-cases we're forcing on it:

But how can that work?

If you have two different max sizes, then how can natural merging work with the too-large segments in the index due to a past forceMerge? It cannot merge them and produce a small enough segment until enough (too many) deletes accumulate on them.

Or, if we had two settings, we could insist that the maxForcedMergeSegmentSize is <= the maxSegmentSize but then what's the point :)

The problem here is forceMerge today sets up an index structure that natural merging is unable to cope with; having forceMerge respect the max segment size would fix that nicely. Users can simply increase that size if they want massive segments.

asfimport commented 7 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

But how can that work?

It will work as defined. For some, this will be worse and they should not have called forceMerge. For others, they knew what they were doing and it's exactly what they wanted. If you don't want 1 big segment, don't call forceMerge(1).

Or, if we had two settings, we could insist that the maxForcedMergeSegmentSize is <= the maxSegmentSize but then what's the point

See LogByteSizeMergePolicy which already works correctly and defaults to maxSegmentSize=2GB, maxForcedMergeSegmentSize=Long.MAX_VALUE

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Mike:

bq: The designer didn't think about this case That's funny! If you only knew how many times "the designer" of some of my code "didn't think about...." well, a lot of things....

bq: Erick, are these timestamp'd documents? Some are, some aren't. Time-series data is certainly amenable to rolling over, but I have clients with significantly different data sets that are not timestamped and don't really work trying to add shards for new time periods.

bq: And 50% is the worst case... true, but in situations where > the index is in the 200G range, implying 40 segments or so default > random ones are replaced

it gets close enough to 50% for me to consider it a norm.

bq: disks are cheap and getting cheaper. But space isn't. I also have clients who simply cannot expand their capacity due to space constraints. I know it sounds kind of weird in this age of AWS but it's true. Some organizations require on-prem servers, either through corporate policy or dealing with sensitive information.

bq: Users already must have a lot of free disk space to accommodate running merges Right, but that makes it worse. To store 1TB of "live" docs, I need an extra TB just to hold the index if it has 50% deleted docs, plus enough free space for ongoing merges. And aggregate indexes are rapidly approaching petabytes (not per shard of course, but.....)

This just looks to me like the natural evolution as Lucene gets applied to ever-bigger data sets. When TMP was designed (hey, I was alive then) sharding to deal with data sets we routinely deal with now was A Big Deal. Solr/Lucene (OK, I'll admit ES too) have gotten much better at dealing with much larger data sets, so it's time to revisit some of the assumptions, and here we are.....

I'll also add that for lots of clients, "just add more disk space" is a fine solution, one I recommend often. The engineering time wasted trying to work around a problem that would be solved with $1,000 of new disks makes me tear my hair out. And I'll add that I don't usually deal with clients that have tiny little 1T aggregate indexes much, so my view is a bit skewed. That said, today's edge case is tomorrow's norm.

And saying "tiny little 1T aggregate indexes" is, indeed, intended to be ironic.....

asfimport commented 7 years ago

Robert Muir (@rmuir) (migrated from JIRA)

so it's time to revisit some of the assumptions, and here we are.....

Except i don't see solr actually doing that. We've identified an actual root cause here (optimize), and there is pushback against fixing it: this is solr stuck in its old ways with top-level fieldcaches and all the other stuff.

In the other case of many deletes, I just imagine the trappy solr features that can create such a situation, delete-by-query and "atomic updates" come to mind.

So when will the root causes get fixed? Solr needs to fix this stuff. Lets not hack around it after-the-fact by making TieredMP more complex.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

I completely agree that removing the Solr optimize button should be done, take that as read. I've linked that JIRA here. I think these two issues are interrelated. We need to give users some tools to control the percentage deleted docs their index accumulates and make it much less tempting to back themselves into a corner.

I do not and will not agree that all uses of forceMerge are invalid. Currently, one thing that contributes to their being overused is the percentage of deleted documents in the index. If a user notices that near 50% of the docs are deleted, what else can they do? expungeDeletes doesn't help here, it still creates a massive segment.

The other valid use case is an index that changes, say, once a day. forceMerge makes perfect sense here since it can be run every time the index is built and does result in some improvements in throughput. People squeezing 1,000s of QPS out of their system are pretty sensitive to any throughput increase they can get.

Making optimize less attractive or harder to use does not address the problem that TMP can (and does! see Mikes blog) accumulate up to 50% of the index as deleted documents during the normal course of an indexes' lifetime.

As for removing "trappy behavior" like delete-by-query or atomic updates, there are completely valid use cases where the entire index gets replaced gradually over time that would get us back into this situation even if those features were removed. And I can't imagine getting consensus that they should be removed.

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> But how can that work?

It will work as defined. For some, this will be worse and they should not have called forceMerge. For others, they knew what they were doing and it's exactly what they wanted. If you don't want 1 big segment, don't call forceMerge(1).

But then the bug is not fixed? I.e. if we don't require forced merges and natural merges to respect the same segment size, then users who force merge and then insist on continuing to change the index can easily get themselves to segments with 97% deletions.

With a single enforced max segment size, even then users can still get into trouble if they really want to, e.g. by making it MAX_LONG, running forceMerge, and then reducing it back to 5 GB default again.

Or maybe we really should deprecate forceMerge and add a new forceMergeAndFreeze method...

See LogByteSizeMergePolicy which already works correctly and defaults to maxSegmentSize=2GB, maxForcedMergeSegmentSize=Long.MAX_VALUE

Just because an older merge policy did it this way does not mean we should continue to repeat the mistake. Two wrongs don't make a right!

I completely agree that removing the Solr optimize button should be done, take that as read.

+1; it's insane how tempting that button makes this dangerous operation. Who wouldn't want to "optimize" their index? Hell if my toaster had a button that looked like Solr's optimize button, I would press it every time I made toast!

I do not and will not agree that all uses of forceMerge are invalid. Currently, one thing that contributes to their being overused is the percentage of deleted documents in the index. If a user notices that near 50% of the docs are deleted, what else can they do? expungeDeletes doesn't help here, it still creates a massive segment.

But if we make the small change to allow max sized segments to be merged regardless of their % deletes then that should fix that reason for force merge?

There are two separate bugs here:

If you force merge then keep updating you can get to segments with 97% deletes; fixing all force merges to respect max segment size fixes this.
50% is too many deleted docs for some use cases; fixing TMP to let the large segments be eligible for merging, always, plus maybe tuning up the existing reclaimDeletesWeight, fixes that.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

I worked up a patch for SOLR-7733, just had the revelation that while I'm at it i can change references to "optimize" in the ref guide to "forceMerge" to see if that makes it less tempting to use.

bq: they really want to, e.g. by making it MAX_LONG, running forceMerge, and then reducing it back to 5 GB default again.

Right, but I'm completely unsympathetic in that case ;).

One question: Do we have any perf statistics on an index with, say, 40 segments .vs. 1 segment (assuming zero deleted docs)? I can run some up I guess...

forceMergeAndFreeze feels wrong to me. At that point the only option if they make a mistake is to re-index everything into another core/collection, right? Unless we have a way to un-freeze the index. Hmmmmm, maybe I'm coming around to that notion now if there's a way to provide recover from a mistake without re-indexing everything. It sure would discourage forceMerging wouldn't it?

forceMergeAndFreeze feels like a separate JIRA though, so I created one and linked it in here.

Proposal: Let's try Mike's suggestions and measure, i.e. > change TMP to allow large segments to be merged (respects max segment size, right?) > require force merge to respect max segment size (assuming this does "singleton rewrites" of segments if there's nothing good to merge them with)

asfimport commented 7 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Its a bug that we created this 20x too big segment to begin with.

I doubt that normal merging would create a segment a 100GB segment with an unmodified TMP config. If that did happen, I would agree that it's a bug. I haven't heard of anyone with that problem.

Users that end up with very large segments are doing so in one of two ways: Either by using IndexUpgrader, or by explicitly using forceMerge. No matter how often I recommend building a new index when upgrading, users still want to use their existing indexes. If they upgrade more than one major version, we send them looking for IndexUpgrader.

In Solr, forceMerge is still named "optimize" ... something we are hoping to change, for the same reasons Lucene did. And even if Solr loses the optimize button in the web UI, many users are still going to do it, with an explicit call to the API. I do it on my own indexes, but relatively infrequently – one large shard is optimized each night by my indexing software, so it takes several days for it to happen across the entire index. A single-segment index does perform better than one with dozens of segments. I have no sense as to how great the performance boost is. I know that recent project wisdom says that the boost is not significant, but even a minimal difference can pay off big in how much query load an index can handle.

asfimport commented 7 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

bq: I doubt that normal merging would create a segment a 100GB segment with an unmodified TMP config

That doesn't happen. forceMerge does do this however. TMP does have the problem that when the segments are max sized (5G by default), they aren't merged until over 50% of the docs in them have been deleted.

bq: Either by using IndexUpgrader.... Yeah, this is probably another JIRA, we shouldn't do a forceMerge with IndexUpgrader, rather just rewrite the individual segments. One coming up....

I also saw that expungeDeletes creates large segments as well. FWIW.

asfimport commented 7 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

Hi Mike,

50% is too many deleted docs for some use cases; fixing TMP to let the large segments be eligible for merging, always, plus maybe tuning up the existing reclaimDeletesWeight, fixes that.

I'm interested in tackling this use-case. This is what you had stated in a previous reply as a potential solution:

Still, I think it's OK to relax TMP so it will allow max sized segments with less than 50% deletions to be eligible for merging, and users can tune the deletions weight to force TMP to aggressively merge such segments. This would be a tiny change in the loop that computes tooBigCount.

So you are proposing changing this statement if (segBytes < maxMergedSegmentBytes/2.0) and make 2.0 ( 50%) configurable ? Wouldn't this mean that the segment sizes keep growing over time well beyond the max limit? Would have have downsides in the long run on the index in terms of performance?

asfimport commented 7 years ago

Varun Thacker (@vthacker) (migrated from JIRA)

Wouldn't this mean that the segment sizes keep growing over time well beyond the max limit

Looking at the code this is not possible. I'll cook up a patch to make this check's weight configurable if (segBytes < maxMergedSegmentBytes/2.0)

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we don't need to add another tunable to TMP; I think the existing reclaimDeletesWeight should suffice, as long as we:

Modify the logic around tooBigCount, so that even too big segments are added to the eligible set, but they are still not counted against the allowedSegCount.

This way TMP is able to choose to merge e.g. a too big segment with 20% deletions, with lots of smaller segments. The thing is, this merge will be unappealing, since the sizes of the input segments are so different, but then the reclaimDeletesWeight can counteract that.

I'll attached a rough patch with what I mean ...

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Very rough, untested patch, showing how we could allow the "too big" segments into the eligible set of segments ... but we should test how this behaves around deletions once an index has too-big segments ... it could be the deletion reclaim weight is now too high!

asfimport commented 7 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

forceMergeAndFreeze feels wrong to me. At that point the only option if they make a mistake is to re-index everything into another core/collection, right?

Or IndexWriter's addIndexes(Directory[]) which is quite efficient.

But yeah I agree this is a separate issue...

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

OK, let's see if I can summarize where we are on this:

1> make TMP respect maxMergeSegmentSize, even during forcemerge unless maxSegments is specified (see <3>).

2> Add some documentation about how reclaimDeletesWeight can be used to tune the % deleted documents that will be in the index along with some guidance. Exactly how this should be set is rather opaque. It defaults to 2.0. The comment in the code is: "but be careful not to go so high that way too much merging takes place; a value of 3.0 is probably nearly too high". We need to keep people from setting it to 1000. Should we establish an upper bound with perhaps a warning if it's exceeded?

3> If people want the old behavior they have two choices: 3a> set maxMergedSegmentMB very high. This has the consequence of kicking in when normal merging happens. I think this is sub-optimal for the pattern where once a day I index docs and then want to optimize at the end though. 3b> specify maxSegments = 1 during forceMerge. This will override any maxMergedSegmentMB settings.

<3b> is my attempt to reconcile the issue of wanting one huge segment but only when doing forceMerge. Yes, they can back themselves into a the same corner they get into now by doing this, but this is acceptable IMO. We're not trying to make it impossible to get into a bad state, just trying to make it so users don't do it by accident.

Is this at least good enough for going on with until we see how it behaves?

Meanwhile, I'll check in SOLR-7733

apache / lucene

Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments [LUCENE-7976] #9025

9417

9052

9310