apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.7k stars 1.04k forks source link

Factor merge policy out of IndexWriter [LUCENE-847] #1922

Closed asfimport closed 17 years ago

asfimport commented 17 years ago

If we factor the merge policy out of IndexWriter, we can make it pluggable, making it possible for apps to choose a custom merge policy and for easier experimenting with merge policy variants.


Migrated from LUCENE-847 by Steven Parkes, resolved Sep 18 2007 Attachments: concurrentMerge.patch, LUCENE-847.patch.txt (versions: 2), LUCENE-847.take3.patch, LUCENE-847.take4.patch, LUCENE-847.take5.patch, LUCENE-847.take6.patch, LUCENE-847.take7.patch, LUCENE-847.take8.patch, LUCENE-847.txt Linked issues:

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch (take 7).

I folded in Ning's comments (above) and Yonik's comments from

1920, added javadocs & fixed Javadoc warnings and fixed two

other small issues. All tests pass on Linux, OS X, win32, with either SerialMergeScheduler or ConcurrentMergeScheduler as the default.

I plan to commit in a few days time...

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

Access of mergeThreads in ConcurrentMergeScheduler.merge() should be synchronized.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Ahh, good catch. Will fix!

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

Hmm, it's actually possible to have concurrent merges with SerialMergeScheduler.

Making SerialMergeScheduler.merge synchronize on SerialMergeScheduler will serialize all merges. A merge can still be concurrent with a ram flush.

Making SerialMergeScheduler.merge synchronize on IndexWriter will serialize all merges and ram flushes.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Hmm, it's actually possible to have concurrent merges with > SerialMergeScheduler.

This was actually intentional: I thought it fine if the application is sending multiple threads into IndexWriter to allow merges to run concurrently. Because, the application can always back down to a single thread to get everything serialized if that's really required?

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> This was actually intentional: I thought it fine if the application is > sending multiple threads into IndexWriter to allow merges to run > concurrently. Because, the application can always back down to a > single thread to get everything serialized if that's really required?

Today, applications use multiple threads on IndexWriter to get some concurrency on document parsing. With this patch, applications that want concurrent merges would simply use ConcurrentMergeScheduler, no?

Or a rename since it doesn't really serialize merges?

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I have to triple check, but on first glance, my apps performance halfed using the ConcurrentMergeScheduler on a recent core duo with 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Today, applications use multiple threads on IndexWriter to get some > concurrency on document parsing. With this patch, applications that > want concurrent merges would simply use ConcurrentMergeScheduler, > no?

True. OK I will make SerialMergeScheduler.merge serialized. This way only one merge can happen at a time even when the application is using multiple threads.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> I have to triple check, but on first glance, my apps performance > halfed using the ConcurrentMergeScheduler on a recent core duo with > 2 GB RAM (As compared to the SerialMergeSceduler). Seems unexpected?

Whoa, that's certainly unexpected! I'll go re-run my perf test.

asfimport commented 17 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Looks like some anomalous tests. Last night I checked twice, but today results are: 58 to 48 in favor of Concurrent. I am going to assume my first results where invalid. Sorry for the noise and thanks for the great patch. Has passed quite a few stress tests I run on my app without any problems so far. Do both merge policies allow for a closer to constant add time or is it just the Concurrent policy?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> Looks like some anomalous tests. Last night I checked twice, but > today results are: 58 to 48 in favor of Concurrent. I am going to > assume my first results where invalid. Sorry for the noise and > thanks for the great patch.

OK, phew!

> Has passed quite a few stress tests I run on my app without any > problems so far.

I'm glad to hear that :) Thanks for being such an early adopter!

> Do both merge policies allow for a closer to constant add time or is > it just the Concurrent policy?

Not sure I understand the question – you mean addDocument? Yes it's only ConcurrentMergeScheduler that should keep addDocument calls constant time, because SerialMergeScheduler will hijack the addDocument thread to do its merges.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached take8, incorporating Ning's feedback plus some small refactoring and fixing one case where optimize() would do an unecessary merge.