[PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources [LUCENE-388]

asfimport commented 19 years ago

Note: I believe this to be the same situation with 1.4.3 as with SVN HEAD.

Analysis using hprof utility shows that during index creation with many documents highlights that the CPU spends a large portion of it's time in IndexWriter.maybeMergeSegments(), which seems to be a 'waste' compared with other valuable CPU intensive operations such as tokenization etc.

Using the following test snippet to retrieve some rows from the db and create an index:

    Analyzer a = new StandardAnalyzer();
    writer = new IndexWriter(indexDir, a, true);
    writer.setMergeFactor(1000);
    writer.setMaxBufferedDocs(10000);
    writer.setUseCompoundFile(false);
    connection = DriverManager.getConnection(
            "jdbc:inetdae7:tower.aconex.com?database=&lt;somedb&gt;", "secret",
            "squirrel");
    String sql = "select userid, userfirstname, userlastname, email from userx";
    LOG.info("sql=" + sql);
    Statement statement = connection.createStatement();
    statement.setFetchSize(5000);
    LOG.info("Executing sql");
    ResultSet rs = statement.executeQuery(sql);
    LOG.info("ResultSet retrieved");
    int row = 0;

    LOG.info("Indexing users");
    long begin = System.currentTimeMillis();
    while (rs.next()) {
        int userid = rs.getInt(1);
        String firstname = rs.getString(2);
        String lastname = rs.getString(3);
        String email = rs.getString(4);
        String fullName = firstname + " " + lastname;
        Document doc = new Document();
        doc.add(Field.Keyword("userid", userid+""));
        doc.add(Field.Keyword("firstname", firstname.toLowerCase()));
        doc.add(Field.Keyword("lastname", lastname.toLowerCase()));
        doc.add(Field.Text("name", fullName.toLowerCase()));
        doc.add(Field.Keyword("email", email.toLowerCase()));
        writer.addDocument(doc);
        row++;
        if((row % 100)==0){
            LOG.info(row + " indexed");
        }
    }
    double end = System.currentTimeMillis();
    double diff = (end-begin)/1000;
    double rate = row/diff;
    LOG.info("rate:" +rate);

On my 1.5GHz PowerBook with 1.5Gb RAM and a 5400 RPM drive, my CPU is maxed out, and I end up getting a rate of indexing between 490-515 documents/second run over 10 times in succession.

By applying a simple patch to IndexWriter (see attached shortly), which defers the calling of maybeMergeSegments() so that it is only called every 2000 times(an arbitrary figure), I appear to get a new rate of between 945-970 documents/second. Using Luke to look inside each index created between these 2 there does not appear to be any difference. Same number of Documents, same number of Terms.

I'm not suggesting one should apply this patch, I'm just highlighting the difference in performance that this sort of change gives you.

We are about to use Lucene to index 4 million construction document records, and so speeding up the indexing process is in our best interest! :) If one considers the amount of CPU time spent in maybeMergeSegments over the initial index creation of 4 million documents, I think one could see how it would be ideal to try to speed this area up (at least move the bottleneck to IO).

I woul appreciate anyone taking a moment to comment on this.

Migrated from LUCENE-388 by Paul Smith, 2 votes, resolved Aug 17 2006 Environment:

Operating System: Mac OS X 10.3
Platform: Macintosh

Attachments: ASF.LICENSE.NOT.GRANTED--IndexWriter.patch, ASF.LICENSE.NOT.GRANTED--log.optimized.deep.txt, ASF.LICENSE.NOT.GRANTED--log.optimized.txt, ASF.LICENSE.NOT.GRANTED--log-compound.txt, ASF.LICENSE.NOT.GRANTED--lucene.34930.patch, ASF.LICENSE.NOT.GRANTED--Lucene Performance Test - with & without hack.xls, doron_2_IndexWriter.patch, doron_2b_IndexWriter.patch, doron_IndexWriter.patch, yonik_indexwriter.diff (versions: 2)

asfimport commented 19 years ago

Paul Smith (migrated from JIRA)

Created an attachment (id=15042) HPROF log of test run pre-patch

Added HPROF log showing high CPU usage inside IndexWriter.maybeMergeSegments() method

asfimport commented 19 years ago

Paul Smith (migrated from JIRA)

Created an attachment (id=15043) Demonstration Patch to defer the call to IndexWriter.maybeMergeSegments() for every 2000 times.

asfimport commented 19 years ago

Daniel Naber (migrated from JIRA)

Isn't the effect the same as setting mergeFactor to 2000, i.e. indexing gets
faster but more RAM is needed?

asfimport commented 19 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

(In reply to comment #3) > Isn't the effect the same as setting mergeFactor to 2000, i.e. indexing gets
> faster but more RAM is needed?

On every add, it looks like the entire segment list is walked looking to see of enough docs (minMergeDocs) can be collected together to do a merge. With a large minMergeDocs, this can get expensive. Perhaps a count should be kept of the number of docs in memory, and when it exceeds minMergeDocs, then call the merge logic.

Of course keeping track of the count would require modifications to many IndexWriter methods. It looks like the performance gains may well be worth it though.

BTW, I don't think a mergefactor of 1000 is typical (way too many filedescriptors in use, and a big hit to searchers). A high minMergeDocs (now maxBufferedDocs) is typical and useful though.

asfimport commented 19 years ago

cutting@apache.org (@cutting) (migrated from JIRA)

Your benchmark might run faster if you set maxBufferedDocs smaller. Also, it doesn't look like you're including the cost of closing the IndexWriter in your benchmark statistics. You should, as, with such a large buffer, you've delayed much of the work to that point.

The bottleneck you're hitting is that maybeMergeDocs sums the size of the buffered indexes each time to decide whether to merge. When you have thousands buffered, this dominates.

To optimize this case (small docs, large maxBufferedDocs) we could keep count of the number of documents buffered by adding a bufferedDocCount field. addDocument could increment this, mergeSegments could decrement it, and maybeMergeSegments could check it with something like:

if (targetMergeDocs == minMergeDocs) { mergeDocs = bufferedDocCount; } else { while (--minSegment >= 0) { ... } }

Does that make sense?

asfimport commented 19 years ago

Paul Smith (migrated from JIRA)

>> Your benchmark might run faster if you set maxBufferedDocs smaller. Also, it >> doesn't look like you're including the cost of closing the IndexWriter in your >> benchmark statistics. You should, as, with such a large buffer, you've delayed >> much of the work to that point. >>

Yes, by not factoring in the optmize()/close() call into the rate calculation, there is still 'work to be done' at the end, but that would only be the tail end of the remaining docs stored in memory, right? When indexing millions of records, this is probably not going to be a large percentage of the overall time, as it would only, at most, the last maxBufferedDocs to be tidied up. Or have I confused myself? I'm still quite new to Lucene and it's inner workings.

>> The bottleneck you're hitting is that maybeMergeDocs sums the size of the >> buffered indexes each time to decide whether to merge. When you have thousands >> buffered, this dominates.

Yes, when maxBufferedDocs is relatively high (which is useful, I thought, if you have the memory to throw at the application and one is trying to stay off IO as much as possible, that's what I've understood anyway) the loop ends up something like this, where N is maxBufferedDocs

n + (n-1)+(n-2) + ....(n-n)

(I'm no math wiz, sorry) You can notice this when indexing and outputing logging information, you see the 'rate' slow down slightly as the number of docs is added to the in memory buffer, then once the automatic merge is performed, the rate speeds up, then progressively slows down again.

>> >> To optimize this case (small docs, large maxBufferedDocs) we could keep count of >> the number of documents buffered by adding a bufferedDocCount field. >> addDocument could increment this, mergeSegments could decrement it, and >> maybeMergeSegments could check it with something like: >> >> if (targetMergeDocs == minMergeDocs) { >> mergeDocs = bufferedDocCount; >> } else { >> while (--minSegment >= 0) { >> ... >> } >> } >> >> Does that make sense?

Err, to be honest I'm not quite sure what you mean by "small docs" in the above first statement. I'm also a little confused on the:

>> if (targetMergeDocs == minMergeDocs) {

and how it relates to the bufferedDocCount you mention.

In my hack/patch I'm effectively keeping track of the number of documents added as you suggest, so I believe we're pretty close to the same thing, but I blame only having one coffee on trying to understand it. :) I think I like where you going though, it smells right.

Something I thought about last night is that the current code works fine for all cases, however the 'clean index being rebuilt' seems to get the raw end of the stick. When an index is being incrementally or batch updated, IndexWriter probably does need to scan the Segments to see if they need merging to obey the configuration settings. However a fresh index is a special case, and seems like it could be optimized (this may be what you meant by 'small docs' ?).

In really large scale indexes/indices(?) the number of documents being incrementally/batch updated is going to be totally minor compared to the cost of building the full index for the first time. For a true High Availability solution, one will always need to factor in the cost to rebuild the index from scratch should it come to that. Making the initial index as fast as possible gives much smaller down time.

For crawling applications that use Lucene, this optimization will probably not even get noticed, because of the latency in retrieving the source material to make up the document. Only when the source material can be accessed fast will this optimization be important.

Removing this CPU bottleneck has even more benefits when you consider those solutions using Lucene to parallel index documents. This optimization means that you get a multiplication of benefit the more CPU's being utilized, particularly if the architecture is a producer/consumer operation with a Buffer in between. (obviously IO starts to get in the way more). With an optimization the CPU can be better utilized performing tokenization etc.

cheers,

Paul

asfimport commented 19 years ago

Paul Smith (migrated from JIRA)

Doug, after more testing, you are correct that tweaking the maxBuffered from 10,000 to 5000 gives slightly better performance.

However I believe this is because with maxBufferedDocs==5000, this loop counter is reset more frequently, and so suffers less of the CPU drain.

I ran more tests using the hacked version using both 5000 and 10000, and the hacked version still runs better. I'll attach an Excel sheet with the results, but I only had time to run it over 5 samples each run, which is not exactly statistically significant, but hopefully still correct.

As you can see, deferring this loop enables larger maxBuffered to gain ahead of smaller maxBufferedDocs.

As always with these tests there are transient factors that affect things (getting DB results for one thing).

asfimport commented 19 years ago

Paul Smith (migrated from JIRA)

Created an attachment (id=15053) Excel file containg results over small samples between Hacked and Non-hacked code using different maxBufferedDocs settings

Attached Excel file with some results

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

Created an attachment (id=15137) Patch that hopefully is what Doug means

Doug,

I could not resolve in my mind exactly what you meant in your example code snippet, but here's a patch that I believe accomplishes the same thing. It's merely a renaming of the variable, and using the 'mergeSegement decrements' concept that you mention which I think is a good idea.

This patch still doesn't speed up the actual detecting of what segments on the stack to merge when it does go into that loop. I need to re-baseline the performance measurements with this patch to see where the CPU is now spending most of it's time, as it it may now be somewhere else entirely.

NOTE: all current unit tests still pass with this patch.

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

Created an attachment (id=15139) HPROF output of the same test with the patch applied

This HPROF shows the CPU savings of the patch.

In summary, the original was this:

1 24.14% 24.14% 859 300342 org.apache.lucene.index.IndexWriter.maybeMergeSegments 2 4.83% 28.97% 172 300306 java.lang.Throwable.fillInStackTrace 3 4.07% 33.04% 145 300336 org.apache.lucene.index.SegmentInfos.info 4 2.50% 35.54% 89 300505 org.apache.lucene.store.RAMInputStream.readInternal 5 2.33% 37.88% 83 300299 java.util.Vector.addElement 6 2.16% 40.04% 77 300334 org.apache.lucene.document.Field.readerValue 7 2.14% 42.17% 76 300272 java.net.SocketInputStream.socketRead0 8 1.40% 43.58% 50 300598 org.apache.lucene.util.PriorityQueue.downHeap

9 1.15% 44.73% 41 300617 org.apache.lucene.util.PriorityQueue.downHeap

10 1.01% 45.74% 36 300581 java.io.RandomAccessFile.writeBytes .....

And after the patch is applied it becomes:

1 5.45% 5.45% 130 300287 java.lang.Throwable.fillInStackTrace 2 5.20% 10.65% 124 300277 java.net.SocketInputStream.socketRead0 3 3.86% 14.51% 92 300515 org.apache.lucene.store.RAMInputStream.readInternal 4 3.27% 17.79% 78 300332 java.util.Vector.addElement 5 2.35% 20.13% 56 300548 java.io.RandomAccessFile.writeBytes 6 2.22% 22.36% 53 300305 org.apache.lucene.document.Field.readerValue 7 1.85% 24.20% 44 300595 org.apache.lucene.util.PriorityQueue.downHeap

8 1.72% 25.92% 41 300580 org.apache.lucene.util.PriorityQueue.downHeap

9 1.51% 27.43% 36 300645 java.net.SocketInputStream.socketRead0 10 1.43% 28.86% 34 300284 org.apache.lucene.store.BufferedIndexOutput.<init> 11 1.43% 30.29% 34 300562 java.lang.Object.clone 12 1.17% 31.46% 28 300346 java.io.StringReader.read 13 1.01% 32.47% 24 300363 org.apache.lucene.index.DocumentWriter.writeNorms

(Note: The Socket reading element is the JDBC driver retrieving the results).

cheers,

Paul

asfimport commented 18 years ago

Mario Ivankovits (migrated from JIRA)

Where do the java.lang.Throwable.fillInStackTrace come from?

asfimport commented 18 years ago

hoschek (migrated from JIRA)

Most likely the IOExceptions stem from FastCharStream of StandardAnalyzer, at least that's my experience. See the header of your hprof file to check if that's the case.

Javacc has a gotcha in that it "likes" IOExceptions as part of the normal control flow on stream termination. I have performance patches for this that make FastCharStream 2-10 times faster for small inputs (it doesn't matter on large inputs), but I never bothered submitted them since they were not crucial for me, and seeing that most submissions to lucene-dev go into a black hole anyway without any response, neither negative or positive, simply ignored... Hint :-)

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

I expanded the depth of the HPFOF test (will attach shortly) but in summary, this is where it comes from:

TRACE 300351: java.lang.Throwable.fillInStackTrace(Throwable.java:Unknown line) java.lang.Throwable.<init>(Throwable.java:196) java.lang.Exception.<init>(Exception.java:41) java.lang.RuntimeException.<init>(RuntimeException.java:43) java.lang.ClassCastException.<init>(ClassCastException.java:39) org.apache.lucene.document.Field.readerValue(Field.java:262)

org.apache.lucene.index.DocumentWriter.invertDocument(DocumentWriter.java:152) org.apache.lucene.index.DocumentWriter.addDocument(DocumentWriter.java:93)

Very strang I must say, I will have to look into that in more detail. Anyone care to comment?

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

Created an attachment (id=15141) Same HPFOF trace, patch applied, but with deeper stack trace

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

See Bug 35037 for related info

asfimport commented 18 years ago

Paul Smith (migrated from JIRA)

(In reply to comment #12) > Most likely the IOExceptions stem from FastCharStream of StandardAnalyzer, at least that's my experience. > See the header of your hprof file to check if that's the case. > > Javacc has a gotcha in that it "likes" IOExceptions as part of the normal control flow on stream termination. > I have performance patches for this that make FastCharStream 2-10 times faster for small inputs (it > doesn't matter on large inputs), but I never bothered submitted them since they were not crucial for me, > and seeing that most submissions to lucene-dev go into a black hole anyway without any response, > neither negative or positive, simply ignored... Hint :-)

Hey, could you send me the patch for this? I'd like to test it out. I've verified the FastCharStream class does throw IOException when EOF, which of course occurs very frequently when indexing small Fields. I can't see a quick and easy solution to indicate EOF without the exception, so I'm keen to see your work and test it out with my scenarios.

cheers,

Paul Smith

asfimport commented 18 years ago

hoschek (migrated from JIRA)

Yep, throwing and catching exception in the critical path is always a performance gotcha, common case or not. See any VM implementation performance papers such as in the IBM Systems journal some years ago, and others.

No idea why the javacc folks didn't come up with an API that does not involve exceptions for normal control flow. Well, javacc has probably been unmaintained dead code for some time now. [Even Xerces has such gotchas deep inside it's low level native API - I chatted with this some time ago with a Sun engineer].

Anyway, you can preallocate the IOException in FastCharStream in a private static final var, and then throw the same exception object again and again on EOS. That gives some factor 2x the cheap way because the stack trace does not have to be generated and filled repeatadly (Same for the QueryParser copy of FastCharStream).

The other additional 5x comes from getting rid of the exception completely - catching exceptions is expensive. This is done via dirty patching the javacc generated code to not require EOS exceptions at all. Instead you can return 0xFFFF as an EOS marker, or some other unused Unicode value. Look at the javacc generated code and see where it catches the EOS exception. That's where you'd need to fiddle around, making sure true abnormal exceptions are still handled properly. It's really an akward maintainance nightmare because it interferes with generated code, so I don't really recommend this.

Still, StandardAnalyzer eats CPU (and buffer memory) like there's no tomorrow. Instead, I'd recommend giving PatternAnalyzer (from the "memory" SVN contrib area) a try. The functionality is almost the same as StandardAnalyzer, but it can be many times faster, especially when using it with a String rather than a Reader, and you don't have to wait indefinitely for lucene to get fixed.

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

This last patch looks good to me, and I remember looking at this issue when Paul brought it up. If I recall correctly, I saw the same wasted time in maybeMergeSegments().

All tests pass, so I'm committing this, only 15 months after it was reported :). Thanks Paul!

asfimport commented 17 years ago

Daniel Naber (migrated from JIRA)

Something is wrong with this patch (as it has been applied) as it increases memory usage. Indexing files with the IndexFiles demo worked before using writer.setMaxBufferedDocs(50) and a tight JVM memory setting (-Xmx2M), now it fails with an OutOfMemoryError.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

One problem I see: If you continuously add batches of documents of size less than maxBufferedDocs, maybeMergeSegments will never trigger a merge. On a close(), it looks like there is some merging logic in flushRamSegments, but it doesn't do full multi-level merging like maybeMergeSegments does.

I think I know how to fix it, but perhaps this should be reverted in the meantime?

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Another problem I see even for single-sessions is that bufferedDocCount is not maintained correctly... all merges are subtracted, not just those of the buffered docs. This will lead to fewer and fewer merges than expected over time.

I definitely vote to revert this. There isn't an easy fix to this patch - a different approach is required.

asfimport commented 17 years ago

Paul Smith (migrated from JIRA)

geez, yep definitely don't put this in, my patch was only a 'suggestion' to highlight how it fixes the root cause of the problem. iIt is interesting that originally, all the test cases still pass, yet the problems Yonik highlights is real. Might warrant some extra test cases to cover exactly those situation, even if this problem is not addressed.

Be great if this could be fixed completely though, but I haven't got any headspace left to continue research on this one.. sorry :(

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Ah, too bad. :( Reverted. I do recall seeing maybeMergeSegments being called a lot, although I don't recall how much actual time was spent in there.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

How does this patch look?

It's designed to exactly match the previous merging behavior, and it's conservative (an exact count is not maintained at all times).

asfimport commented 17 years ago

Daniel Naber (migrated from JIRA)

Hi Yonik, I just tested the patch: sorry, but the problem is the same as before: I get an OutOfMemoryError using settings that without the patch. That doesn't mean that the patch is wrong of course, but as we're after performance improvements it wouldn't make sense to compare it to the old version which uses less memory.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> the problem is the same as before: I get an OutOfMemoryError using settings that without the patch

That's worrisome though... the increase in code and data is minimal, and my patch should not trigger any different merging behavior. Unless the previous version was passing the low-mem test by a few bytes, something is still wrong.

asfimport commented 17 years ago

Paul Smith (migrated from JIRA)

This is where some tracing logging code would be useful. Maybe a YourKit memory snapshot to see what's going on.. ? I can't see Yonik's patch should influence the memory profile. It's just delaying the check for merging until an appropriate time, and should not be removing opportunities to merge segments. I can't see why checking less often uses more memory.

Obviously something strange is happening.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Well, the patch changes what the hotspot is, so the hotspot compiler could be trying to compile different code...

Daniel, what is the exact test you are running? what JVM and other settings? Perhaps trying both versions with -Xbatch or -Xint would remove the differences?

Perhaps we need a method to determine if index "invariants" have been met...

asfimport commented 17 years ago

Daniel Naber (migrated from JIRA)

I just see that this OOM is not exactly reproducible. I would have expected that it always happens when it has happened once when indexing the same data with the same settings (though not necessarily at the same time). But that doesn't seem to be the case. I use Java 1.4.2, -Xmx2M, writer.setMaxBufferedDocs(50) and I am indexing 2200 files with the IndexFiles class from the Lucene demo package. Also, I let the code run in Eclipse, maybe this has negative side effects.

So your code is probably okay, although I suggest you set up a similar test case just to be sure.

asfimport commented 17 years ago

DM Smith (migrated from JIRA)

In Eclipse you can set the run to use its own JVM. Otherwise it runs in Eclipse's JVM and it has negative memory side effects.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Something is wrong with my patch... the behavior still differs from the original. I wrote a testInvariants() method, and merges are not always done when they need to be.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Got it!

Attatching new version with a minor change... I forgot to take into account deleted documents. After a merge, instead of setting the count to minMergeDocs, we set it to 0, forcing a recount on the next add. The reason is that deleted docs could have been squeezed out during the merge ,leaving a smaller segment that might be used in another merge shortly.

Example of original behavior (minMergeDocs=10): // segment sizes = 55,8,1 add doc // segment sizes = 55,8,1,1 maybe merge // segment sizes = 55,9 (not 10 because of a previously deleted doc) add doc // segment sizes = 55,9,1 maybe merge // segment sizes = 55,10

Now, the original behavior isn't necessarily desirable, but we are taking baby steps... first "do no harm". A separate patch can address merging behavior in the presence of deleted documents.

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

It seems that the excessive cpu usage is mainly for (re)scanning those single-doc segments at the top of the "stack".

The following simple patch (small modification to previous one) only keeps track of the number of single doc segments, and when iterating for merge candidate segments, takes all those single doc segments in one step.

All tests pass with this patch, running a bit faster (3 seconds - can be just noise). A "tight mem setting" (2048 cacm files with -Xmx3m and setMaxBufferedDocs(50) ) that fails with previous patch passes with this one. I checked the trace for which segments were merged - same merge decisions as before this change for these 2048 filess.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I was literally a minute away from committing my version when Doron sumbitted his ;-) Actually, I think I like Doron's "singleDocSegmentsCount" better.... it's easier to understand at a glance.

I was testing the performance for mine... not as much of a speeup as I would have liked... 5 to 6% better with maxBufferedDocs=1000, and a trivial single field document. You need to go to maxBufferedDocs=10000 to see a good speedup, and that's probably not advisable for most real indicies (and the maxBufferedDocs=1000 used much less memory and was slightly faster anyway).

Here is the code I added to IndexWriter to test my version (add testInvariants() after add() call and after flushRamSegments() in close(), then do "ant test")

private synchronized void testInvariants() { // index segments should decrease in size int maxSegLevel = 0; for (int i=segmentInfos.size()~~1; i>=0; i~~-) { SegmentInfo si = segmentInfos.info(i); int segLevel = (si.docCount)/minMergeDocs; if (segLevel <maxSegLevel) {

    throw new RuntimeException("Segment #" + i + " is too small. " + segInfo());
  }
  maxSegLevel = Math.max(maxSegLevel,segLevel);
}

// check if merges needed
long targetMergeDocs = minMergeDocs;
int minSegment = segmentInfos.size();

while (targetMergeDocs <= maxMergeDocs && minSegment&gt;=0) {
  int mergeDocs = 0;
  while (--minSegment >= 0) {
    SegmentInfo si = segmentInfos.info(minSegment);
    if (si.docCount >= targetMergeDocs) break;
    mergeDocs += si.docCount;
  }

  if (mergeDocs >= targetMergeDocs) {
    throw new RuntimeException("Merge needed at level "+targetMergeDocs + " :"+segInfo());
  }

  targetMergeDocs \*= mergeFactor;        // increase target size
}

}

private String segInfo() { StringBuffer sb = new StringBuffer("minMergeDocs="minMergeDocs"docsLeftBeforeMerge="docsLeftBeforeMerge" segsizes:"); for (int i=0; i<segmentInfos.size(); i++) { sb.append(segmentInfos.info(i).docCount); sb.append(","); } return sb.toString(); }

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I've committed Doron's version after review and further testing with my testInvariants() (also committed, but commented out). Thanks Doron & Paul!

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

well.... there is a problem in the current patch after all... the counter is not decremented when a merge is triggerred by a call to optimize. It is only decremented when a merge is called from maybeMergeSegments(). Therefore the counter decrement should move to mergeSegments(int,int). I am testing this fix now and will add a new patch for this soon. Doron

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

The attached doron_2_IndexWriter.patch is fixing the updating of singleDocSegmentsCount to take place in mergeSegments(minSegment, end) so that it would apply also when optimize() is called. The update of the counter now considers the range of the merge (so the counter is not necessarily updated to 0).

The bug in previous implementation was occasionally using TestIndexModifier.testIndexWithThreads() to fail with ArrayIndexOutOfBoundsException on the segments array.

I ran this test several times with this fix and now it consistently passes. "ant test" passes as well.

I hope we're done with this bug...

Doron

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Thanks Doron, I caught that too and I was just going to set the count to 0 in mergeSegments (mergeSegments is always called with end == size() currently I think). Your fix is better though - gives more flexibility.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

We could also make the following change to flushRamSegments, right?

private final void flushRamSegments() throws IOException { int minSegment = segmentInfos.size() - singleDocSegmentsCount; int docCount = singleDocSegmentsCount;

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Right... actually it should be like this:

int minSegment = segmentInfos.size() - singleDocSegmentsCount - 1;

But since flushRamSegments() is only called by close() and optimize(), no real performance gain is expected here.

So I'm not sure what my preference is between - (a) do not to change here, because "why change a working code to be perhaps a bit more complex for no performance gain". (b) change here too, also to be consistent with how this counter is used in maybeMergeSegments().

Anyway I tested this change and it works - so I am attaching also this version - doron_2b_IndexWriter.patch - in case there is a favor for (b).

Doron

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Paul, would you like to re-open this issue for (re)solving it with one of the two recent patches (2 or 2b) - I think that once an issue is resolved it can be re-opened (or closed) by the developer who opened it. (I assume commiters can also re-open, but I'm not sure - well, at least I can't.) Thanks, Doron

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

No need to re-open Doron, I committed doron_2_IndexWriter.patch at the same time as my first reply (shortly after you posted it).

asfimport commented 17 years ago

Doron Cohen (migrated from JIRA)

Oh - sorry for the 'noise' - got used to systems where a commit must be attached to an issue/defect - should have checked the commits list or the code itself - I'm learning all the time...

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

No problem... I probably should have put "LUCENE-388" in the commit log (it allows JIRA to link up the related commits automatically)

apache / lucene

[PATCH] IndexWriter.maybeMergeSegments() takes lots of CPU resources [LUCENE-388] #1466