Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments [LUCENE-7976]

asfimport commented 6 years ago

We're seeing situations "in the wild" where there are very large indexes (on disk) handled quite easily in a single Lucene index. This is particularly true as features like docValues move data into MMapDirectory space. The current TMP algorithm allows on the order of 50% deleted documents as per a dev list conversation with Mike McCandless (and his blog here: https://www.elastic.co/blog/lucenes-handling-of-deleted-documents).

Especially in the current era of very large indexes in aggregate, (think many TB) solutions like "you need to distribute your collection over more shards" become very costly. Additionally, the tempting "optimize" button exacerbates the issue since once you form, say, a 100G segment (by optimizing/forceMerging) it is not eligible for merging until 97.5G of the docs in it are deleted (current default 5G max segment size).

The proposal here would be to add a new parameter to TMP, something like <maxAllowedPctDeletedInBigSegments> (no, that's not serious name, suggestions welcome) which would default to 100 (or the same behavior we have now).

So if I set this parameter to, say, 20%, and the max segment size stays at 5G, the following would happen when segments were selected for merging:

> any segment with > 20% deleted documents would be merged or rewritten NO MATTER HOW LARGE. There are two cases, >> the segment has <5G "live" docs. In that case it would be merged with smaller segments to bring the resulting segment up to 5G. If no smaller segments exist, it would just be rewritten >> The segment has > 5G "live" docs (the result of a forceMerge or optimize). It would be rewritten into a single segment removing all deleted docs no matter how big it is to start. The 100G example above would be rewritten to an 80G segment for instance.

Of course this would lead to potentially much more I/O which is why the default would be the same behavior we see now. As it stands now, though, there's no way to recover from an optimize/forceMerge except to re-index from scratch. We routinely see 200G-300G Lucene indexes at this point "in the wild" with 10s of shards replicated 3 or more times. And that doesn't even include having these over HDFS.

Alternatives welcome! Something like the above seems minimally invasive. A new merge policy is certainly an alternative.

Migrated from LUCENE-7976 by Erick Erickson (@ErickErickson), 7 votes, resolved Jun 17 2018 Attachments: LUCENE-7976.patch (versions: 13), SOLR-7976.patch Linked issues:

SOLR-8839
- 9417
- SOLR-12513
- 9052
- SOLR-12259
- SOLR-7733
- 9310
- SOLR-7733

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Mike:

Thanks, I'm off on vacation until the 22nd, then probably catching up the last part of that week, I'll take a look then. No doubt after leaving it alone for a couple of weeks I'll wonder why the heck I did that....

Erick

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

@mikemccand OK, I'm back at it.

------------easy parts. I'll incorporate all your style suggestions, NP there. Except for twoMayHaveBeenMerged. That's not a typo, it's meant to cover the case where there could be one tiny segment and findForcedMerges legitimately combines that small segment with a larger one even if there are no deletions while still respecting maxMergedSegmentSizeMB. I've added a bit more explanation in the comments though, if it confused you it would confuse others. Maybe I'll be able to think of a better name.

Can you change doFindMerges to not modify the incoming list...

I've done that. Pending the resolution to maxSegments, holdInCase may not be necessary though in which case copying to a local variable is unnecessary. I'm not sure I care given how expensive merging is in general, one extra list copy is probably immeasurable (and cleaner).

------hard part 1

These parts still seem smelly to me:

Yeah, me too.

Or, maybe we can change the problem: when you forceMerge to N segments, should we really allow violating the maxMergedSegmentMB constraint in that case?

Frankly I'd like to have exactly one valid value for maxSegments: 1 . That is, the only choices would be to merge down to 1 segment or respect maxMergedSegmentMB. WDYT?

Another option is to relax what specifying maxSegments means to a "best effort". If we just calculate the max segment size based on the number of segments specified and not worry about meeting it exactly. Then the loop goes away. In that case, the 25% (or whatever) fudge factor isn't dangerous since there'd be no loop, but fudging it up isn't totally necessary either.

I do wonder who, if anybody, uses maxSegments except to avoid the issue that started this, that is optimize creating a large segment that can accumulate lots of deleted docs. Has it outlived its usefulness?

---- hard part 2

And I think that's the line you were asking me about with this?

bq. But I think that's dangerous – won't this mean that we will fail to return a merge that could concurrently run with already merging segments, in the cascading case?

So the scenario here is that we have a long-running merge running on one thread and, even though we could be running a dozen other small merges we wouldn't run any of them until the big one was done, right?

OK, I'll put this back.

--------------- TBD then: So I think the only thing remaining from your review is what to do about maxSegments and that really horrible loop. Options in the order I like ;)

1> Remove the loop entirely. Calculate the theoretical segment size and add 25% (or whatever) and call doFindMerges once. The purpose of the loop is to handle the case where that calculations results in N+M segments when N was specified but M additional segments were created because they didn't pack well (your bin packing characterization). Update the docs to say it's on a "best effort" basis.

This option preserves most of the intent of maxSegments without introducing the complexity/danger. And I think you're right, with enough segments we'd never get out of that loop.

2> Allow maxSegments to have only one option: 1. Otherwise we just respect maxMergedSegmentSizeMB. Users can approximate doing forceMerge with maxSegments with values other than 1 by configuring their maxMergedSegmentsSizeMB to suit their wishes and do a forceMerge without specifying maxSegments.

So all told this is closer than I feared when I saw how long your response was ;). If we decide to go with option <1> for maxSegments, all that would remain after that is putting back the bits with find merges even if there are merges currently running.

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Iteration N+1. This one removes the horrible loop that concerned @mikemccand, and good riddance to it. Also puts in all the rest of the changes so far.

2 out of 2,004 iterations of TestTieredMergePolicy.testPartialMerge failed because a forceMerge was specified with maxSegments != 1 that didn't produce the exact number of segments specified. I changed the test a bit to accommodate the fact that if we respect maxSegmentSize + 25% as an upper limit, then there are certainly some situations where the expected segment count will not be exactly what's specified. Is this acceptable? It's the packing problem.

And of course I thought that when the segment count is 1 there should be no ambiguity so that's why two patches are uploaded so close to each other.

Meanwhile I'll run another couple of thousand iterations and the whole precommit/test cycle again.

Pending more comments I think we're close.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @ErickErickson; I'll look at the new iteration soon! Sounds promising :)

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Hit another failure (one in 2004), chasing now. Looks like I've messed up the hitTooLarge in some weird case. Don't think it'll be anything major, FYI.

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

The failure was a bad test assumption, only a test change.

asfimport commented 6 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

+          // A singleton merge with no deletes makes no sense. We can get here when forceMerge is looping around...
+          if (candidate.size() == 1) {
+            SegmentSizeAndDocs segSizeDocs = segInfosSizes.get(candidate.get(0));
+            if (segSizeDocs.segDelDocs == 0) {
+              continue;
+            }

should we check here if the segDelDocs is less that the threshold rather than checking if there is at least one delete.

+          if (haveOneLargeMerge == false || bestTooLarge == false || mergeType == MERGE_TYPE.FORCE_MERGE_DELETES) {

I have a question about it, I might just not understand this well enough:

if we have seen one or more large merges we don't add the merge
if the best one is too large neither
but always when we do force merge deletes

no if we have not seen a too large merge but the best one is too large we still add it? is this correct, don't we want to prevent these massive merges? I might just miss something sorry for being slow.

+      this.segDelDocs = maxDoc;

I do wonder about the naming here why is this named maxDoc should it be named delCount or so?

+    private final SegmentCommitInfo segInfo;
+    private final long segBytes;
+    private final int segDelDocs;
+    private final int segMaxDoc;
+    private final String segName;

can I suggest to remove the seg prefix. It's obivous form the name. I also think it should be delCount instead.

 if (haveWork == false) return null;

can you plese use parentesis around this?

  SegmentInfos infos =
        SegmentInfos.readLatestCommit(searcher.getIndexReader().directory());

in SegmentsInfoRequestHandler solr reads the SegmentInfos from disk which will not result in accurate counts. I know this is a preexisting issue I just want to point it out. IW will use the object identity of the SegmentCommitInfo of the reader to look up it's live-stats for NRT deletes etc.

I do like the tests, I would loved to see them work without index writer. They should be real unittests not relying on stats in IW. Do you think you can still fix that easily. Not a blocker just a bummer :/

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Simon:

Thanks, I'll be able to work on this sometime this week I hope.

I'm trying to keep "scope creep" from happening too much here so may add another JIRA or two for things that are pre-existing. I'll detail when I incorporate your comments.

I won't consider committing this before the 7.4 version is cut, if for no other reason than I want this to get as much mileage as possible before it's included in a release.

6,000 iterations of TestTieredMergePolicy later and no failures so I'm starting to feel optimistic.

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

@s1monw

should we check here if the segDelDocs is less that the threshold rather than checking if there is at least one delete.

Not unless we redefine what forceMerge does. It's perfectly possible to have a segment at this point that's 4.999G with one document deleted. It'll be horribly wasteful, but it's no worse than what has always happened with forceMerge.

Outside of forceMerge, segments won't be eligible unless they have 10% deleted docs.

In the case of findMerge, I'm counting on the scoring mechanism to keep this from being a problem.

no if we have not seen a too large merge but the best one is too large we still add it? is this correct, don't we want to prevent.

This is awkward at present in that it preserves the old behavior. findForcedDeletesMerges has always allowed multiple large merges, leaving that for a later JIRA.

In the other cases, this will prevent multiple large merges because the first time we get a large merge, haveOneLargeMerge == false and bestTooLarge == true so we create a large merge.

Thereafter, if bestTooLarge == true we'll avoid adding it.

I do wonder about the naming here why is this named maxDoc should it be named delCount or so?

Brain fart, changed. I started out doing one thing then changed it without noticing that.

can I suggest to remove the seg prefix. It's obivous form the name. I also think it should be delCount instead

Done

can you plese use parentesis around this?

Done

in SegmentsInfoRequestHandler solr reads the SegmentInfos from disk which will not result in accurate counts.

Good to know, is there a better way to go? I don't think total accuracy is necessary here.

..I would loved to see them work without index writer....Do you think you can still fix that easily

I have no idea ;) I saw the discussion at 8330 but didn't see any test conversions I could copy. I'll put up another version of this patch momentarily, if you could show me the pattern to use I'll see what I can do. That said, if it's involved at all I'd like to put it in a follow-on JIRA.

@mikemccand This set of changes is purely style, no code changes. So unless there are objections, I'll commit it sometime next week.

asfimport commented 6 years ago

Simon Willnauer (@s1monw) (migrated from JIRA)

I have no idea I saw the discussion at 8330 but didn't see any test conversions I could copy. I'll put up another version of this patch momentarily, if you could show me the pattern to use I'll see what I can do. That said, if it's involved at all I'd like to put it in a follow-on JIRA.

lets do a followup

Michael McCandless This set of changes is purely style, no code changes. So unless there are objections, I'll commit it sometime next week.

+1 to the patch

Good to know, is there a better way to go? I don't think total accuracy is necessary here.

I do agree that it's not absolutely necessary but confusing as hell if you call it and then it tells you it should merge but it doesn't. I think it defeats the purpose of this API entirely. I just thought I should mention it.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks @ErickErickson; patch looks a lot better! I'm glad that scary loop is gone.

A few small things:

Can we remove this code:

          // A singleton merge with no deletes makes no sense. We can get here when forceMerge is looping around...                                                                    
          if (candidate.size() == 1) {
            SegmentSizeAndDocs segSizeDocs = segInfosSizes.get(candidate.get(0));
            if (segSizeDocs.delCount == 0) {
              continue;
            }
          }

If we fix the above loop to not add the singleton merge unless it has deletes?

Can you rename maxMergeAtonce --> maxMergeAtOnce?

Hmm shouldn't this code only run if the merge candidate is max sized (bestTooLarge)? I.e. change true to bestTooLarge?

            if (bestTooLarge) {
              haveOneLargeMerge = true;
            }

I think this logic might be buggy?

      boolean maxMergeIsRunning = false;
      if (mergeType == MERGE_TYPE.NATURAL) {
        maxMergeIsRunning = mergingBytes >= maxMergedSegmentBytes;
      }

E.g. if we have picked two merges to run, neither of which is the max segment size, but when added together they are over the max, then we incorrectly conclude maxMergeIsRunning?

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

If we fix the above loop to not add the singleton merge unless it has deletes?

I don't think so. Since this is common code, then it's quite possible that during forceMerge we assemble some segments that have no deletes due to maxMergeAtOnceExplicit that are still a fraction of maxMergedSegmentMB. These segments are eligible next pass to be merged even though they have no deleted documents. So we can't just omit them from the candidate up-front.

...rename maxMergeAtonce to maxMergeAtOnce

Done. Autocomplete strikes again, one misspelling and it propagates.

I.e. change true to bestTooLarge?

I've no objection, but what's the functional difference? Just making sure there's not a typo there.

I think this logic is buggy?

The more I look the more I think it's always been buggy. Or at least should be restructured.

Check me out on this. As far as I can tell, mergingBytes would be the exact same in the old code every time it was calculated. Every time through the loop for gathering the best merge, the code looks in the same infosSorted (which doesn't change) and starts from the same point every time (tooBigCount which doesn't change) and adds to mergingBytes if (and only if) the segment is in mergeContext.getMergingSegments() (which doesn't change).

mergingBytes really just asks if the sum of all the segments that could be merged are currently being merged total more than maxMergedSegmentBytes. So I'll make the new code do the same thing. I can calculate that value outside the loop and just set it once.

Or I'm missing something that'll be obvious when someone else points it out.

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

All tests pass, beasting 2,004 times succeeds, Mike's latest comments incorporated. So if my latest response to Mike is acceptable I'll commit over the weekend.

I'll beast it a bit more while I'm at it, since that's mostly waiting.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't think so. Since this is common code, then it's quite possible that during forceMerge we assemble some segments that have no deletes due to maxMergeAtOnceExplicit that are still a fraction of maxMergedSegmentMB. These segments are eligible next pass to be merged even though they have no deleted documents. So we can't just omit them from the candidate up-front.

Ahhh, you are right! So let's leave it in.

I.e. change true to bestTooLarge?

I've no objection, but what's the functional difference? Just making sure there's not a typo there.

Oh duh not sure what I was thinking – there would be no functional difference ;) OK maybe change to this?:

haveOneLargeMerge |= bestTooLarge;

I think this logic is buggy?

The more I look the more I think it's always been buggy. Or at least should be restructured.

Hmm I said might be buggy, but somehow when you quoted me it became is!

Anyway, the maxMergeIsRunning logic prevents picking a "max sized" merge if the total bytes being merged across all running merges is >= the max merged size, which I think is good. But I don't see where TMP is doing this same thing? We do compute mergingBytes, and pass it to score but otherwise seem not to use it?

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

maybe change to this?:

Sure. Next patch.

Anyway, the maxMergeIsRunning logic prevents picking a "max sized" merge if the total bytes being merged across all running merges is >= the max merged size, which I think is good.

Right, good to know we're talking about the same thing ;)

But I don't see where TMP is doing this same thing? We do compute mergingBytes, and pass it to score but otherwise seem not to use it?

This is where I get lost. I looked at mergingBytes clear back to the first revision of this file and mergingBytes never been used in score(...), just passed as a parameter. I think Simon took it out as part of #9377.

All it's used for is to set maxMergeIsRunning to prevent a computed candidate from being used if a max merge is already running, just as you say. So the latest patch passes that boolean along to doFindMerges and does the same thing with it. Since the old code only seemed to care about this when doing a "natural" merge, the new code passes false from the other two cases.

How about this. If that makes no sense and we only think we're talking about the same thing, maybe hop on a Google hangout or something at your convenience and see if we can reconcile it all?

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

maybe hop on a Google hangout or something at your convenience and see if we can reconcile it all?

+1, just reach out when you have a chance? I'm just confused where TMP is currently implementing this.

asfimport commented 6 years ago

Lucene/Solr QA (migrated from JIRA)

❌ -1 overall

Vote	Subsystem	Runtime	Comment
Prechecks
+1	test4tests	0m 0s	The patch appears to include 1 new or modified test files.
master Compile Tests
+1	compile	7m 27s	master passed
Patch Compile Tests
+1	compile	6m 58s	the patch passed
-1	javac	1m 49s	lucene_core generated 3 new + 1 unchanged - 0 fixed = 4 total (was 1)
+1	Release audit (RAT)	2m 27s	the patch passed
+1	Check forbidden APIs	1m 49s	the patch passed
+1	Validate source patterns	1m 49s	the patch passed
+1	Validate ref guide	1m 49s	the patch passed
Other Tests
+1	unit	43m 49s	core in the patch passed.
-1	unit	65m 50s	core in the patch failed.
-1	unit	13m 31s	solrj in the patch failed.
	148m 29s

Reason	Tests
Failed junit tests	solr.rest.TestManagedResourceStorage
solr.cloud.autoscaling.SearchRateTriggerIntegrationTest
solr.cloud.autoscaling.ScheduledMaintenanceTriggerTest
solr.client.solrj.impl.CloudSolrClientTest
solr.common.util.TestJsonRecordReader

Subsystem	Report/Notes
JIRA Issue	LUCENE-7976
JIRA Patch URL	https://issues.apache.org/jira/secure/attachment/12927680/LUCENE-7976.patch
Optional Tests	compile javac unit ratsources checkforbiddenapis validatesourcepatterns validaterefguide
uname	Linux lucene2-us-west.apache.org 4.4.0-112-generic #135-Ubuntu SMP Fri Jan 19 11:48:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Build tool	ant
Personality	/home/jenkins/jenkins-slave/workspace/PreCommit-LUCENE-Build/sourcedir/dev-tools/test-patch/lucene-solr-yetus-personality.sh
git revision	master / 2db2fb3
ant	version: Apache Ant(TM) version 1.9.6 compiled on July 8 2015
Default Java	1.8.0_172
javac	https://builds.apache.org/job/PreCommit-LUCENE-Build/33/artifact/out/diff-compile-javac-lucene_core.txt
unit	https://builds.apache.org/job/PreCommit-LUCENE-Build/33/artifact/out/patch-unit-solr_core.txt
unit	https://builds.apache.org/job/PreCommit-LUCENE-Build/33/artifact/out/patch-unit-solr_solrj.txt
Test Results	https://builds.apache.org/job/PreCommit-LUCENE-Build/33/testReport/
modules	C: lucene/core solr/core solr/solrj solr/solr-ref-guide solr/webapp U: .
Console output	https://builds.apache.org/job/PreCommit-LUCENE-Build/33/console
Powered by	Apache Yetus 0.7.0 http://yetus.apache.org

This message was automatically generated.

asfimport commented 6 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I just chatted w/ @ErickErickson and indeed I was simply confused – the current TMP already has logic to not run multiple "max sized" merges, and so this patch isn't changing that.

+1 to push!

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

I'm not quite sure what's happening, but my two recent pushes don't seem to auto-add the git link to the JIRA.

Revision for master: 2519025fdafe55494448854c87e094b14f434b41

Revision for 7x: 9c4e315c1cb3495f400c179159836f568cd2989d

asfimport commented 6 years ago

Erick Erickson (@ErickErickson) (migrated from JIRA)

Thanks for all who helped!

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

git bisect points the finger at commit 2519025 on this issue for the reproducing failure noted at SOLR-12513.

asfimport commented 6 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

git bisect blames commits 2519025 & 9c4e315 on this issue for the reproducing failures noted at #9417.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit a8a1cf8a88915eb42786e5e7d8a321130f67b689 in lucene-solr's branch refs/heads/branch_7x from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a8a1cf8

LUCENE-7976: Fix indentation.

asfimport commented 6 years ago

ASF subversion and git services (migrated from JIRA)

Commit 799d2acd88e2886f90ff276ccb8d443ca0268963 in lucene-solr's branch refs/heads/master from @jpountz https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=799d2ac

LUCENE-7976: Fix indentation.

apache / lucene

Make TieredMergePolicy respect maxSegmentSizeMB and allow singleton merges of very large segments [LUCENE-7976] #9025

9417

9052

9310