ConcurrentMergeScheduler/maxMergeCount default is too low [LUCENE-5705]

asfimport commented 10 years ago

The default value for maxMergeCount in ConcurrentMergeScheduler is 2. This causes problems for Solr's dataimport handler when very large imports are done from a JDBC source.

What happens is that when three merge tiers are scheduled at the same time, the add/update thread will stop for several minutes while the largest merge finishes. In the meantime, the dataimporter JDBC connection to the database will time out, and when the add/update thread resumes, the import will fail because the ResultSet throws an exception. Setting maxMergeCount to 6 eliminates this issue for virtually any size import – although it is theoretically possible to have that many simultaneous merge tiers, I've never seen it.

As long as maxThreads is properly set (the default value of 1 is appropriate for most installations), I cannot think of a really good reason that the default for maxMergeCount should be so low. If someone does need to strictly control the number of threads that get created, they can reduce the number. Perhaps someone with more experience knows of a really good reason to make this default low?

I'm not sure what the new default number should be, but I'd like to avoid bikeshedding. I don't think it should be Integer.MAX_VALUE.

Migrated from LUCENE-5705 by Shawn Heisey (@elyograg), updated May 26 2014 Attachments: dih-example.patch, infostream-s0build-shard.zip, LUCENE-5705.patch (versions: 2)

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Patch against 4x that sets the new default for maxMergeCount to 8.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The purpose of maxMergeCount is to put back pressure on ongoing indexing when merges are falling behind. It's very bad when merges fall behind because you get too many segments in the index, searching slows down, PK (id) lookups slow down, too many file handles opened on NRT readers, etc.

The current default maxZMergeCount (2) means that if 2 merges are already needed (one is running) and a 3rd merge shows up, then the incoming thread is stalled until the merges can catch up. Maybe we can increase it to 3, but I don't think we should go higher than that by default.

Maybe Solr can increase this limit temporarily while importing from JDBC? Or maybe we need a less "harsh" way to apply back-pressure, e.g. in Elasticsearch we force indexing to be single-threaded (not outright stopped) when merges can't keep up.

Do you know why merges can't keep up in your use case? E.g. are you throttling the merge IO?

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

New patch with javadoc updates and CHANGES.txt. I put the change under Optimizations, but I don't know if that's the right place.

Is this a change that we should be making, or do we have a really good reason other than threadcount to have such a low default?

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Do you know why merges can't keep up in your use case? E.g. are you throttling the merge IO?

I have the TMP equivalent of mergeFactor 35, and I'm importing 16 million docs from MySQL into each shard. The final shard size is over 18GB. I've seen the same thing happen to others with the default mergeFactor. ramBufferSizeMB is 48. I have no throttling config. The index is on a RAID10 volume comprised of six 1TB SATA disks with a 1MB stripe size, so it's not slow. It just takes several minutes to merge at the gigabyte scale.

Recently I added autoCommit at 25000 docs with openSearcher=false, which I think does reduce the size of each initial segment a little bit, but I have not tried again with the default maxMergeCount. I've had mine at 6 for years now, and others have had their import problems fixed with that setting.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

The purpose of maxMergeCount is to put back pressure on ongoing indexing when merges are falling behind.

This is definitely something that I had not considered. Solr DIH is probably the one place where this back pressure can be considered a bad thing by default. If we decide that the best solution does involve changes to Solr DIH rather than the scheduler, I will move the issue to the Solr project.

At the very least, I think it's a good idea to add maxMergeCount to the example-DIH indexConfig section and include a comment.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

An idea for a change to solr/example/example-DIH.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1 to fix DIH's example configuration, but also to commit your javadoc improvements to CMS.setMaxMergeAndThreads.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

+1 to not change the default in CMS, and commit jdocs, but I have two minor comments: (1) maybe change "lower than the default of 8" to "lower than the default of {@value ...}"? (2) instead of &lt ;= you can just use &le ;?

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

The javadoc changes that I made do need to change again if we don't also make the code changes.

I think the new javadoc need to be the following:

  /**
   * Sets the maximum number of merge threads and simultaneous merges allowed.
   * 
   * `@param` maxMergeCount the max # simultaneous merges that are allowed.
   *       If a merge is necessary yet we already have this many
   *       threads running, the incoming thread (that is calling
   *       add/updateDocument) will block until a merge thread
   *       has completed.  If index data is coming from a source that is
   *       sensitive to inactivity timeouts (like JDBC), it is advisable to
   *       set this value higher than default so that the incoming thread
   *       never stops.  Note that we will only run the smallest
   *       <code>maxThreadCount</code> merges at a time.
   * `@param` maxThreadCount the max # simultaneous merge threads that should
   *       be running at once.  This must be &lt;= <code>maxMergeCount</code>.
   *       Most setups should use the default value of 1 here.
   *       If the index is on Solid State Disk and there are
   *       plenty of CPU cores available, it is usually safe to
   *       run more threads simultaneously.
   */

I did notice the following comment in the 4x branch, but this has not been my experience with Solr. Older versions seemed to prefer running the largest merge to completion before doing the smaller ones. The behavior described here would be preferable. If the comment is accurate, does anyone know when it changed? I originally ran into my problem back on Solr 1.4.1 (Lucene 2.9), and I am pretty sure that some of the people I've helped on the mailing list and IRC were running some 4.x version, so I am not sure that this comment is accurate even for 4.x:

  // Max number of merge threads allowed to be running at
  // once.  When there are more merges then this, we
  // forcefully pause the larger ones, letting the smaller
  // ones run, up until maxMergeCount merges at which point
  // we forcefully pause incoming threads (that presumably
  // are the ones causing so much merging).

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I did notice the following comment in the 4x branch

Wait, this comment should also be in trunk?

Older versions seemed to prefer running the largest merge to completion before doing the smaller ones.

Hmm that should not have been the case; if you turn on IW infoStream, CMS tells you when it's pausing a large merge (confusingly, via the abort method) to let smaller merges finish.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Wait, this comment should also be in trunk?

It very likely is in trunk. I was just trying to be precise about where I actually looked, in case trunk says something slightly different and others happen to be looking too.

Hmm that should not have been the case; if you turn on IW infoStream, CMS tells you when it's pausing a large merge

I've never actually done this. I will turn on infostream and start a new full rebuild of the entire 96 million document index. Those infostreams will be available after several hours, and they ought to be very large.

I just know that when importing millions of records from a database, if you don't increase maxMergeCount, the incoming thread will stall long enough for JDBC to kill the connection. If smaller merges were really running first, then it seems like we would never be over the threshold long enough for the connection to die – my smallest merges would probably complete in less than a second, and the next size up would only take a few seconds. When I first noticed the problem, I clocked one merge-caused indexing pause at over eight minutes.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

I do see evidence in the infostream that I'm currently creating that merges are done out of order with preference to small merges.

IW 4 [Sun May 25 09:43:57 MDT 2014; Lucene Merge Thread #11]: merge time 47224 msec for 563274 docs
IW 4 [Sun May 25 09:52:39 MDT 2014; Lucene Merge Thread #13]: merge time 8761 msec for 68640 docs
IW 4 [Sun May 25 09:53:44 MDT 2014; Lucene Merge Thread #12]: merge time 266527 msec for 4227876 docs

When I was having the problem I described (which was admittedly a long time ago, Solr 1.4.0 most likely), I was using the old default, LogByteSizeMergePolicy. Would that have been using CMS, or a different scheduler? When no scheduler is configured in Solr 4.x, does it choose CMS? I would think that it does.

I have seen others have this problem very recently on the mailing list and IRC. I'm reasonably sure that at least one of them was on a 4.x release. Bumping up maxMergeCount has fixed it for those people, just like it did for me. The evidence that's right before my eyes would suggest that nobody should still be having any problems like this, assuming that what they are getting by default is the ConcurrentMergeScheduler.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

... unless users are actually running into a situation with 4 or 5 merges scheduled at the same time ... in which case the first couple would proceed very quickly, but they would still have much larger merges queued up.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

Orthogonal to this issue, but it sounds like you're doing a large initial import from JDBC? Maybe you only do a single import even? In that case, maybe it's better if you disable merges at all during import, then turn on merges, call maybeMerge (I'm not sure if Solr has a command to do that though) with a MergeScheduler that runs more than the default concurrent merges? Just an idea...

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

but it sounds like you're doing a large initial import from JDBC? Maybe you only do a single import even?

I do imports on all of my shards at once. There are 96 million docs total right now, growing at a rate of a few million per year. One shard (which we call the incremental, but is better known as a "hot" shard) has only the newest few hundred thousand docs in it. The rest of the docs are split between the other six shards with a MySQL CRC32 calculation on the database primary key, modulo 6. In production, each of two servers holds three cold shards, the second server also holds the hot shard. This means that for the long haul of a full rebuild, each server is doing three imports at the same time. This is not running in cloud mode.

DIH is only used for full rebuilds. I have a SolrJ program that does normal updates and starts/monitors/finishes the rebuilds.

A strong possibility right now is that with my current settings (4.7.2, explicitly configured with TMP, CMS, and an effective mergeFactor of 35), I would no longer run into this issue on my production hardware that has fairly fast disks. Grepping for the "merge time" lines so far only shows the one overlap which I pasted above.

With the more frequent merging inherent in the default mergeFactor of 10, other users might have a bigger chance of running into a problem, especially with single or mirrored 7200RPM disks.

My dev hardware has slower disks (7200RPM RAID1) and houses all seven shards on one server. Rebuilds take nearly twice as long there as they do on the production hardware - rebuilds on that hardware are definitely I/O bound. The infostream that is now building is being done on the production hardware.

asfimport commented 10 years ago

Shai Erera (@shaie) (migrated from JIRA)

It depends what would you like to achieve. If you import documents that amount to 100 segments and care only about the end result, i.e. a merged index (per the MP settings), then I am not sure it will matter much if you first import w/o merging, and then call maybeMerge(). But if you care about how fast DIH finishes importing, and are willing to let merges run in the background while e.g. the index is searched, then disabling merges while you import data will improve latency in that respect.

When I experimented with building indexes on Hadoop, I always disabled merges while the index was built, and executed a special job afterwards. This prevented a lot of copying around HDFS. Not saying this is your case, but sometimes it's useful to turn off merges, when you're executing batch-oriented jobs.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

disabling merges while you import data will improve latency in that respect.

If I had a Lucene program, turning off merging is likely a very simple thing to do. With Solr, is that possible to change without filesystem (solrconfig.xml) modification, and without restarting Solr or reloading cores? If it is, I could do an optimize as the last step of a full rebuild. The lack of merging during the rebuild, followed by an optimize at the end, would probably be faster than what happens now. If I have to change the config and restart/reload, then this is not something I can implement – anyone who has access can currently kick off a rebuild simply by changing an entry in a MySQL database table. The SolrJ program notices this and starts all the the dataimport handlers in the build cores. Managing filesystem changes from a Java program across multiple machines is not something I want to try. If I switched to SolrCloud, config changes are relatively easy using the zkCli API, but switching to SolrCloud would actually lead to a loss of functionality in my index.

Once the index is built, my SolrJ program does a full optimize on one cold shard per day, so it takes six days for the whole index. The hot shard is optimized once an hour – only takes about 30 seconds.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Here's the summary of merge results for one of the indexes, after the rebuild was done:

IW 4 [Sun May 25 08:34:02 MDT 2014; Lucene Merge Thread #0]: merge time 38437 msec for 413954 docs
IW 4 [Sun May 25 08:39:58 MDT 2014; Lucene Merge Thread #1]: merge time 34488 msec for 411844 docs
IW 4 [Sun May 25 08:46:12 MDT 2014; Lucene Merge Thread #2]: merge time 6705 msec for 61045 docs
IW 4 [Sun May 25 08:53:23 MDT 2014; Lucene Merge Thread #3]: merge time 54341 msec for 623054 docs
IW 4 [Sun May 25 08:59:38 MDT 2014; Lucene Merge Thread #4]: merge time 9369 msec for 88050 docs
IW 4 [Sun May 25 09:07:22 MDT 2014; Lucene Merge Thread #5]: merge time 53734 msec for 625095 docs
IW 4 [Sun May 25 09:12:40 MDT 2014; Lucene Merge Thread #6]: merge time 10407 msec for 95045 docs
IW 4 [Sun May 25 09:20:03 MDT 2014; Lucene Merge Thread #7]: merge time 47114 msec for 560845 docs
IW 4 [Sun May 25 09:24:39 MDT 2014; Lucene Merge Thread #8]: merge time 5368 msec for 46523 docs
IW 4 [Sun May 25 09:31:26 MDT 2014; Lucene Merge Thread #9]: merge time 51475 msec for 619516 docs
IW 4 [Sun May 25 09:36:28 MDT 2014; Lucene Merge Thread #10]: merge time 9420 msec for 88276 docs
IW 4 [Sun May 25 09:43:57 MDT 2014; Lucene Merge Thread #11]: merge time 47224 msec for 563274 docs
IW 4 [Sun May 25 09:52:39 MDT 2014; Lucene Merge Thread #13]: merge time 8761 msec for 68640 docs
IW 4 [Sun May 25 09:53:44 MDT 2014; Lucene Merge Thread #12]: merge time 266527 msec for 4227876 docs
IW 4 [Sun May 25 09:56:07 MDT 2014; Lucene Merge Thread #14]: merge time 38959 msec for 495135 docs
IW 4 [Sun May 25 10:06:31 MDT 2014; Lucene Merge Thread #15]: merge time 32033 msec for 410559 docs
IW 4 [Sun May 25 10:14:07 MDT 2014; Lucene Merge Thread #16]: merge time 7521 msec for 54797 docs
IW 4 [Sun May 25 10:21:12 MDT 2014; Lucene Merge Thread #17]: merge time 48044 msec for 576053 docs
IW 4 [Sun May 25 10:27:41 MDT 2014; Lucene Merge Thread #18]: merge time 6843 msec for 62448 docs
IW 4 [Sun May 25 10:34:33 MDT 2014; Lucene Merge Thread #19]: merge time 44991 msec for 619962 docs
IW 4 [Sun May 25 10:40:08 MDT 2014; Lucene Merge Thread #20]: merge time 11078 msec for 118848 docs
IW 4 [Sun May 25 10:46:40 MDT 2014; Lucene Merge Thread #21]: merge time 54392 msec for 643896 docs
IW 4 [Sun May 25 10:52:17 MDT 2014; Lucene Merge Thread #22]: merge time 7091 msec for 74945 docs
IW 4 [Sun May 25 11:00:10 MDT 2014; Lucene Merge Thread #23]: merge time 44073 msec for 584655 docs
IW 4 [Sun May 25 11:09:57 MDT 2014; Lucene Merge Thread #24]: merge time 5769 msec for 49129 docs
IW 4 [Sun May 25 11:21:50 MDT 2014; Lucene Merge Thread #25]: merge time 307003 msec for 4128767 docs
IW 4 [Sun May 25 11:22:31 MDT 2014; Lucene Merge Thread #25]: merge time 41087 msec for 463915 docs
IW 4 [Sun May 25 11:27:01 MDT 2014; Lucene Merge Thread #26]: merge time 12255 msec for 107006 docs
IW 4 [Sun May 25 11:39:36 MDT 2014; Lucene Merge Thread #27]: merge time 44532 msec for 618865 docs
IW 4 [Sun May 25 11:48:01 MDT 2014; Lucene Merge Thread #28]: merge time 8192 msec for 82499 docs
IW 4 [Sun May 25 12:00:37 MDT 2014; Lucene Merge Thread #29]: merge time 54516 msec for 775824 docs
IW 4 [Sun May 25 12:12:46 MDT 2014; Lucene Merge Thread #30]: merge time 9692 msec for 101961 docs
IW 4 [Sun May 25 12:19:33 MDT 2014; Lucene Merge Thread #31]: merge time 51258 msec for 732080 docs
IW 4 [Sun May 25 12:25:20 MDT 2014; Lucene Merge Thread #32]: merge time 11955 msec for 124069 docs
IW 4 [Sun May 25 12:34:20 MDT 2014; Lucene Merge Thread #33]: merge time 57059 msec for 743397 docs
IW 4 [Sun May 25 12:40:12 MDT 2014; Lucene Merge Thread #34]: merge time 7408 msec for 71889 docs
IW 4 [Sun May 25 12:48:40 MDT 2014; Lucene Merge Thread #35]: merge time 47083 msec for 628885 docs
IW 4 [Sun May 25 13:02:48 MDT 2014; Lucene Merge Thread #36]: merge time 282123 msec for 4761885 docs
IW 4 [Sun May 25 13:02:58 MDT 2014; Lucene Merge Thread #36]: merge time 9565 msec for 103121 docs
IW 4 [Sun May 25 13:11:26 MDT 2014; Lucene Merge Thread #37]: merge time 30681 msec for 426626 docs
IW 4 [Sun May 25 13:20:44 MDT 2014; Lucene Merge Thread #38]: merge time 30638 msec for 408589 docs
IW 4 [Sun May 25 13:28:14 MDT 2014; Lucene Merge Thread #39]: merge time 4735 msec for 42766 docs
IW 4 [Sun May 25 13:36:49 MDT 2014; Lucene Merge Thread #40]: merge time 51305 msec for 622337 docs
IW 4 [Sun May 25 13:45:10 MDT 2014; Lucene Merge Thread #41]: merge time 8094 msec for 79872 docs
IW 4 [Sun May 25 13:52:23 MDT 2014; Lucene Merge Thread #42]: merge time 48678 msec for 640757 docs
IW 4 [Sun May 25 13:59:05 MDT 2014; Lucene Merge Thread #43]: merge time 11398 msec for 92616 docs

One problem with no merging is the number of open files. The merges listed above, assuming that each one sees 35 segments merged down to one, that means that the index drops by 34 segments for each one, result in a net difference of nearly fifteen hundred segments, with 72 segments remaining when indexing finishes. That's a LOT of files. If I were to turn on useCompoundFile, it would greatly reduce the file count and make the idea manageable ... but I'm curious about whether the compound file results in lower real world performance.

I will attach the INFOSTREAM file that I grepped to get the output above.

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Attaching a zipfile with the infostream from one of my shards during a full index rebuild with DIH.

asfimport commented 10 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

CMS has been the default for a long time now (even back when LogMP was the default). TMP won't change things that much for the append-only case.

I think even on fast disks your merging can fall behind: it's a question of whether the indexing threads can produce segments faster than merging can consolidate them. Also, the amount of free RAM that the OS can use for readahead on the files opened for merging its can have a big impact on merge performance.

If you search "pausing thread" and "unpausing thread" you should see it pausing the largest merge(s) when more than one are running. Search for "too many merges; stalling..." to see when the harsh back-pressure kicks in.

Doing all merging in the end is somewhat dangerous; you should only do it if you know you will do no searching on the index until the merging has completed. I suspect it will net/net make indexing take longer because you are not soaking up concurrency during indexing to get merges done.

Net/net it's really important that Lucene doesn't allow too many segments in the index; the "harsh" back-pressure Lucene applies today (hard stall of ALL indexing threads) is effective but ... harsh.

If we improved CMS to make this behavior "optional", so that by default it continued its effective-but-harsh-back-pressure, but then an app (Solr, ES) could instead do its own thing (ES throttles down to one indexing thread instead of 0 that Lucene does), then Solr could do something similar here. Maybe open a new issue for that? (Hmm: is Solr using multiple indexing threads in your case...?).

asfimport commented 10 years ago

Shawn Heisey (@elyograg) (migrated from JIRA)

Doing all merging in the end is somewhat dangerous; you should only do it if you know you will do no searching on the index until the merging has completed.

I actually can guarantee this when I do a full rebuild. For every one of my shards, I have a build core and a live core. Full rebuilds happen on the build cores, and I can be sure that no searching will take place until after that import is done and the SolrJ program swaps it with the live core. If It's possible in Solr to enable and disable merging on the fly for each core, then that might be a viable path. I would need to change my post-import processes a little bit – I run a delete process against the new index, and the current delete process does a query to check for document presence first. I'd have to add a boolean option to the method so it would be able to do the deletes blindly.

I don't even open a new searcher until the full import is done. I don't let DIH do the commit, I handle that. I do have the realtime get handler enabled (which does open a new searcher on every autoCommit), but that almost never actually sees requests. On the build cores, I think I can safely say that it would actually never happen.

(Hmm: is Solr using multiple indexing threads in your case...?)

I doubt it. In 1.x and 3.x I did have the DIH threads option enabled, but in 4.x the option was removed. Even in my SolrJ program, there is only ever one thread making requests to any core. Most requests end up on the hot shard, which is why I have a hot shard.

apache / lucene

ConcurrentMergeScheduler/maxMergeCount default is too low [LUCENE-5705] #6767