Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) [LUCENE-565]

asfimport commented 18 years ago

Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged.

We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below.

API Changes

We propose adding a "deleteDocuments(Term term)" method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter.

Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document.

Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated.

Coding Changes

Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.

We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by "CHANGE". We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action.

Performance Results

To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build.

We experimented with three workloads:

Insert only. 1.6M documents were inserted and the final index size was 2.3GB.
Insert/delete (big batches). The same documents were inserted, but 25% were deleted. 1000 documents were deleted for every 4000 inserted.
Insert/delete (small batches). In this case, 5 documents were deleted for every 20 inserted.
```
                        current       current          new
```
Workload IndexWriter IndexModifier IndexWriter

Insert only 116 min 119 min 116 min Insert/delete (big batches) – 135 min 125 min Insert/delete (small batches) – 338 min 134 min

As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches.

Regards, Ning

Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120

Migrated from LUCENE-565 by Ning Li, 8 votes, resolved Feb 13 2007 Attachments: LUCENE-565.Feb2007.patch, NewIndexModifier.Jan2007.patch, NewIndexModifier.Jan2007.take2.patch, NewIndexModifier.Jan2007.take3.patch, NewIndexModifier.Sept21.patch, perfres.log, perf-test-res.JPG, perf-test-res2.JPG, TestBufferedDeletesPerf.java Linked issues:

1747

asfimport commented 18 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Can you please attach diffs rather than complete files? The diffs should not not contain CHANGE comments. To generate diffs, check Lucene out of Subversion, make your changes, then, from the Lucene trunk, run something like 'svn diff > my.patch'. New files should first be added with 'svn add' so that they're included in the diff. Thanks!

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

Here is the diff file of IndexWriter.java.

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

I took a look at the patch and it looks good to me (anyone else had a look)? Unfortunately, I couldn't get the patch to apply :(

$ patch -F3 < IndexWriter.patch (Stripping trailing CRs from patch.) patching file IndexWriter.java Hunk #1 succeeded at 58 with fuzz 1. Hunk #2 succeeded at 112 (offset 2 lines). Hunk #4 succeeded at 504 (offset 33 lines). Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines). missing header for unified diff at line 259 of patch (Stripping trailing CRs from patch.) can't find file to patch at input line 259 Perhaps you should have used the -p or --strip option? The text leading up to this was: ... ... ... File to patch: IndexWriter.java patching file IndexWriter.java Hunk #1 FAILED at 802. Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines). 1 out of 2 hunks FAILED – saving rejects to file IndexWriter.java.rej

Would it be possible for you to regenerate the patch against IndexWriter in HEAD?

Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix).

Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example:

if (infoStream != null) infoStream.print("merging segments");

Perhaps this was just an oversight?

Looking forward to the new patch. Thanks!

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

For an overview of my changes, I'll repeat some of what I said in my earlier e-mail (see http://www.gossamer-threads.com/lists/lucene/java-dev/35143), then add more detail about specific coding changes.

Overview

Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged.

API Changes

We propose adding a "deleteDocuments(Term term)" method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter.

Coding Changes

Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.

For simplicity of explanation, let's assume the index resides in a disk-based directory.

Changes to the IndexWriter variables:

segmentInfos used to store the info of all segments (on disk or in ram). Now it only stores the info of segments on disk.
ramSegmentInfos is a new variable which stores the info of just ram segments.
bufferedDeleteTerms is a new variable which buffers delete terms before they are applied.
maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls the max number of delete terms that can be buffered before they must be flushed to disk.

Changes to IndexWriter methods:

addDocument() The info of the new ram segment is added to ramSegmentInfos.
deleteDocuments(), batchDeleteDocuments() The terms are added to bufferedDeleteTerms. bufferedDeleteTerms also records the current number of documents buffered in ram, so the delete terms can be applied to ram segments as well as the segments on disk.
flushRamSegments() Step 1: Apply buffered delete terms to all the segments on disk. Step 2: Merge all the ram segments into one segment on disk. Step 3: Apply buffered delete terms to the new segment appropriately, so that a delete term is only applied to the documents buffered before it, but not to those buffered after it. Step 4: Clean up and commit the change to the index (both the new segment and the .del files if it applies).
maybeMergeSegments() Before, a flush would be triggered only if enough documents were buffered. Now a flush is triggered if enough documents are buffered OR if enough delete terms are buffered.

asfimport commented 18 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Thanks for all the information about coding changes, that makes it easier to understand the diff. Ideally this will become comments in the new diff, which I can commit.

Yonik mentioned this in email. It does sound like a better place for this might be in a higher level class. IndexWriter would really not be just a writer/appender once delete functionality is added to it, even if it's the IndexReaders behind the scenes doing the work. So if you are going to be redoing the patch, consider this.

Perhaps IndexModifier methods should be deprecated and it should get a new/your API?

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

Hi Otis,

I've attached two patch files:

IndexWriter.July09.patch is an updated version of the old patch.
NewIndexModifier.July09.patch makes minimal changes to IndexWriter and puts new functionalities in a new class called NewIndexModifier. I didn't name it IndexModifier because the two are unrelated and I don't want a diff of the two.

All unit test succeeded except the following one: [junit] Testcase: testIndex(org.apache.lucene.index.TestIndexModifier): FAILED [junit] expected:<3> but was:<4> [junit] junit.framework.AssertionFailedError: expected:<3> but was:<4> [junit] at org.apache.lucene.index.TestIndexModifier.testIndex(TestIndexModifier.java:67)

However, the unit test has a problem, not the patch: IndexWriter's docCount() does not tell the actual number of documents in an index, only IndexReader's numDocs() does. For example, in a similar test below, where 10 documents are added, then 1 deleted, then 2 added, the last call to docCount() returns 12, not 11, with or without the patch.

public void testIndexSimple() throws IOException { Directory ramDir = new RAMDirectory(); IndexModifier i = new IndexModifier(ramDir, new StandardAnalyzer(), true); // add 10 documents initially for (int count = 0; count < 10; count++) { i.addDocument(getDoc()); } i.flush(); i.optimize(); assertEquals(10, i.docCount()); i.deleteDocument(0); i.flush(); assertEquals(9, i.docCount()); i.addDocument(getDoc()); i.addDocument(getDoc()); i.flush(); assertEquals(12, i.docCount()); }

The reason for the docCount() difference in the unit test (which does not affect the correctness of the patch) is that flushRamSegments() in the patch merges all and only the segments in ram and write to disk, whereas the original flushRamSegments() merges not only the segments in ram but sometimes also one segment from disk (see in that function the comment "// add one FS segment?").

Regards, Ning

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

Hopefully, third time's a charm. :-)

I rewrote IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass. NewIndexModifier is such a subclass that supports delete-by-term.

Here is an overview of the changes:

Changes to IndexWriter Changes to IndexWriter variables:

segmentInfos used to store the info of all segments (on disk or in ram). Now it only stores the info of segments on disk.
ramSegmentInfos is a new variable which stores the info of just ram segments. Changes to IndexWriter methods:
addDocument() The info of the new ram segment is added to ramSegmentInfos.
maybeMergeSegments() toFlushRamSegments() is called at the beginning to decide whether a flush should take place.
flushRamSegments() doAfterFlushRamSegments() is called after all ram segments are merged and flushed to disk.

NewIndexModifier New variables:

bufferedDeleteTerms is a new variable which buffers delete terms before they are applied.
maxBufferedDeleteTerms is similar to maxBufferedDocs. It controls the max number of delete terms that can be buffered before they must be flushed to disk. Overloaded/new methods:
deleteDocuments(), batchDeleteDocuments() The terms are added to bufferedDeleteTerms. bufferedDeleteTerms also records the current number of documents buffered in ram, so the delete terms can be applied to ram segments as well as the segments on disk.
toFlushRamSegments() In IndexWriter, a flush would be triggered only if enough documents were buffered. Now a flush is triggered if enough documents are buffered OR if enough delete terms are buffered.
doAfterlushRamSegments() Step 1: Apply buffered delete terms to all the segments on disk. Step 2: Apply buffered delete terms to the new segment appropriately, so that a delete term is only applied to the documents buffered before it, but not to those buffered after it. Step 3: Clean up the buffered delete terms.

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

I tried out this patch (July18), and have a few comments...

First, it is nice to be able to add/remove documents with no need to care for switching between readers and writers, and without worrying for performance issues as result of that switching. I did not test for performance yet.

This post is quite long, so here is an outline... (1) Compile error in test code (2) Failing tests - is this patch actually fixing a bug in current flushRamSegments()? (3) Additional tests I ran (4) Javadocs remarks (5) deleteDocument(int doc) not implemented (6) flush() not implemented (7) Method name - batchDeleteDocuments(Term[]) (8) Class name and placement + What's Next for this patch

------------ (1) Compile error in test code The new TestWriterDelete does not reflect recent name change from IndexWriter to NewIndexModifier. Easily fixed by renaming accordingly in that file.

------------ (2) Failing tests - does this patch also fix a bug in current flushRamSegments()? "ant test" has one failure: TestIndexModifier.testIndex(). This is the same issue that Ning described above. However I think it exposes a bug in current flushRamSegments(): when an index with 1 segment on disk that has 2 documents, one of which is deleted, and 1 segment in memory, is closed, this method decides to merge - prematurely - the two segments into one. This wrong behavior (if I understand things correctly) is - by "mistake" - causing TestIndexModifier.testIndex() to pass in the current implementation of flushRamSegments(). But this comes with the cost of too many merges. If one is interleaving adds and deletes this bug would become costly. I will post a separate question on this to the dev forum, to discuss if this is indeed a bug.

------------ (3) Additional tests I ran I wanted to verify that all existing functions (well, at least tested ones..) are working with the new class (NewIndexModifier). So I temporarily renamed the existing IndexWriter to IndexWriter0, and renamed NewIndexModifier to IndexWriter (now extending IndexWriter0). For compiling, now, also had to temporarily modify args from IndexWriter to IndexWriter0 in 3 classes - DocumentWriter, SegmentMerger, and also from NewIndexModifier to IndexWriter in the new TestWriteDelete. (Note again: these modifications are temporary, just for the sake of testing this new class as if it was the new IndexWriter, which it is not.) Now all the tests were using the new class instead of the original IndexWriter.

All tests passed, except for TestIndexModifier.testIndex() - this is the same failure as above - so, no problem detected in new class.

------------ (4) Javadocs Remarks Current Javadocs for the new class focus on changes to the implementation. I think this description of implementation changes should be made regular Java comments (for developers), and instead should add a shorter javadoc that describes the API for users, and the implications on behavior as result of buffering deletes.

------------ (5) deleteDocument(int doc) not implemented Original IndexModifier has a delete(int docs), the new class doesn't. At first this seems ok, since internal doc IDs are not accessible through index writer (unlike index reader). But IndexModifier also does not provide access to doc-ids. So why was delete-by-id enabled in IndexModifier? Perhaps there's a good reason for it, that I fail to see - if so, it should probably be added to the new class as well. Adding this is required if the new class would eventually replace the implementation of current index modifier.

------------ (6) flush() not implemented Original IndexModifier has a flush(int docs) method, allowing to commit any pending changes. I think it would be nice to have this feature here as well, for forcing any pending changes (without caller having to modify the max-bufferred value). This would allow more control when using this class. Again, adding this is required if the new class would eventually replace the implementation of current index modifier.

------------ (7) Method name - batchDeleteDocuments(Term[]) I would prefer it to be called deleteDocuments(Term[]), and let Java decide which method to call. Main reason is developers would expect that methods with similar semantics are named similarly, especially when using IDEs like Eclipse, where users type "i.del" and the IDE lets them select from all the methods that start with "del".

------------ (8) Class name and placement + What's Next for this patch Performance test should be added for this new class. Also, I did not code review the actual code changes to IndexWriter and the code of NewIndexModifier itself.

It seems to me that this class would be very useful for users, either as a new class or if it replaces the current implementation of IndexModifier. Latter would be possible only if the 2 missing methods mentioned above are added. In this case, the "immediate delete" behavior of current IndexModifier should be possible to achieve by users, by setting maxBefferedDeleteTerms to 1.

One disadvantage of this class vs. current IndexModifier is the ability to add access to further methods of IndexReader. With current IndexModifier this is very simple (though not always efficient) - just follow code template in existing methods, i.e. close writer/reader and open reader/writer as required. With the new class, exposing further methods of IndexReader would be more of a challenge. Perhaps having a multiReader on all segment readers can do. I am not sure to what extent this should be a consideration, so just bringing it up.

So, If this class replaces IndexModifier - fine. If not, how about calling it BufferredIndexWriter?

Doron

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

I tested just the IndexWriter from this code base, it does not seem to work. NewIndexModifier does work. I simply used IndexWriter to create several documents and then search for them. Nothing came back even though it seems something was written to disk.

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

> Yes I am including this patch as it is very useful for increasing > the efficiency of updates as you described. I will be conducting > more tests and will post any results. Yes a patch for IndexWriter > will be useful so that the entirety of this build will work. > Thanks!

I've attached a patch that works with the current code. The implementation of IndexWriter and NewIndexModifier is the same as the last patch. I removed the "singleDocSegmentsCount" optimization from this patch since my IndexWriter checks singleDocSegmentsCount by simply calling ramSegmentInfos.size().

This patch had evolved with the help of many good discussions (thanks!) since it came out in May. Here is the current state of the patch:

This patch aims at enabling users to do inserts and general deletes (delete-by-term, and later delete-by-query) without switching between writers and readers.
The goal is achieved by rewritting IndexWriter in such a way that semantically it's the same as before, but it provides extension points so that delete-by-term, delete-by-query, and more functionalities can be easily supported in a subclass.
NewIndexModifier extends IndexWriter and supports delete-by-term by simply overriding two methods: toFlushRamSegment() which decides if a flush should happen, and doAfterFlushRamSegments() which does proper work after a flush is done.

Suggestions are welcome! Especially those that may help it get committed. :-)

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

Doron, thank you very much for the review! I want to briefly comment on one of your comments:

> (5) deleteDocument(int doc) not implemented

I deliberately left that one out. This is because document ids are changing as documents are deleted and segments are merged. Users don't know exactly when segments are merged thus ids are changed when using IndexModifier. Thus I don't think it should be supported in IndexModifier at all.

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

This IndexWriter seems to work. Thanks. Great work!

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

I ran a performance test for interleaved adds and removes - and compared between IndexModifier and NewIndexModifier.

Few setups were tested, with a few combinations of "consecutive adds before a delete takes place", maxBufferredDocs, and "number of total test iterations", where each iteration does the conseutive adds and then does the deletes.

Each setup ran in this order - orig indexModifier, new one, orig, new one, and the best time out of the two runs was used.

Results indicate that NewIndexModifier is far faster for most setups.

Attached is the performance test, the performance results, and the log of the run. The performance test is written as a Junit test, and it fails in case the original IndexModfier is faster than the new one by more than 1 second (smaller than 1 sec difference is considered noise).

Test was run on XP (SP1) with IBM JDK 1.5.

Test was first failing with "access denied" errors due to what seems to be an XP issue. So in order to run this test on XP (and probably other Windows platforms) the patch from #1740 should be applied first.

It is interesting to notice that in addition to preformance gain, NewIndexModifier seems less sensitive to "access denied" XP problems, because it closes/reopens readers and writers less frequently, and indeed, at least in my runs, these errors had to be bypassed (by the "retry" patch) only for the current index-modifier.

Doron

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

It seems this writer works, but then some mysterious happens to the index and the searcher can no longer read it. I am using this in conjunction with Solr. The index files look ok, however a search will return nothing. I have seen this repeatedly over about 1 weeks time.

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

Is it that results that were returned are suddenly (say after updates) not returned anymore (indicating something bad happened to existing index)?

Or is it that the search does not reflect recent changes?

I don't remember how often Solr closes and re-opens the writer/modifier... with this patch a delete does not immediately cause a "flush to disk" - so flushes are controlled by closing the NewIndexModifier (and re-opening, since there no flush() method) and by the limits for max-bufferred-docs and max-bufferred-deletes. If this seems relevant to your case, what limits are in effect?

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

I started to flush the deletes after making them, which opens a new NewIndexModifier afterwards. I still see the same thing. I am starting off by deleting all documents by matching on a Term that all of them have. Commit (reopen), then perform a batch addDocuments. Then when a search is executed nothing is returned, and after an optimize the index goes down to 1K. Seems like some peculiarity in NewIndexModifier. Seems like the new documents are deleted even after they are added.

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

Just to make sure on the scenario - are you - (1) using NewIndexModifier at all, or (2) just letting Solr use this IndexWriter (with the code changes introduced to ebable NewIndexModifier) instead of the Lucene's svn-head (or cetrain release) IndexModifier.

As is, Solr would not use NewIndexModifier or IndexModifier at all.

For case (2) above the bufferred deletes logic is not in effect at all.

I wonder if it possibe to re-create this with a simple Lucene stand-alone (test) program rather than with Solr - it would be easier to analyze.

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

Good points... I've actually used both NewIndexModifier and the parent. I've tried writing a new UpdateHandler, and incorporating the new IndexWriter into DirectUpdateHandler2. I will create a non-Solr reproduction of the issue. I guess it has something to do with ths doc ids being reused and so the new documents that are added are also marked as deleted as the number of documents would match almost exactly after the rebuild. I am not an expert in regards to that aspect of Lucene.

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

Having trouble reproducing this. Probably something in the other code. Thanks for the help and the patch, I feel more confident in it now.

asfimport commented 18 years ago

Jason Rutherglen (migrated from JIRA)

I figured out the problem, the Solr DirectUpdateHandler2 expects to delete only a certain number of documents specifically the oldest first in some cases by using TermDocs and deleting by the doc id. NewIndexModifier deletes at the level of the SegmentReader however. Any good way to do this?

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

Updated performance test results - perf-test-res2.JPG - in avarage, the new code is 9 times faster!

What have changed? - in previous test I forgot to set max-buffered-deletes.

After fixing so, I removed the test cases with max-buffer of 5,000 and up, because they consumed too much memory, and added more practical (I think) cases of 2000 and 3000.

Here is a textual summary of the data in the attached image:

max buf add/del 10 10 100 1000 2000 3000 iterations 1 10 100 100 200 300 adds/iteration 10 10 10 10 10 10 dels/iteration 5 5 5 5 5 5 orig time (sec) 0.13 0.86 9.57 8.88 22.74 44.01 new time (sec) 0.20 0.95 1.74 1.30 2.16 3.08 Improvement (sec) -0.07 -0.09 7.83 7.58 20.58 40.94 Improvement (%) -55% -11% 82% 85% 90% 93%

Note: for the first two cases new code is slower by 11% and 55%, but this is a very short test case, - the absolute difference here is less than 100ms, comparing to the other cases, where the difference is measured in seconds and 10's of seconds.

asfimport commented 18 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> the new code is 9 times faster!

That's a bit apples-and-oranges :-) I don't think people use IndexModifier when they need performance... they buffer things and do them in batches.

IMO, performance of existing code using IndexWriter is more important, and what I would be interested in. Say indexing 10000 documents to a RamDirectory with default settings (no deletes at all). I haven't had a chance to review the new code, so I don't know if it's something to worry about or not.

asfimport commented 18 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I believe this patch probably also changes the merge behavior. I think we need to discuss what exactly the new merge behavior is, if it's OK, what we think the index invariants should be (no more than x segments of y size, etc), and I'd like to see some code to test those invariants.

Keep in mind the difficulty of getting the last IndexWriter patch correct (and that was a very minor patch to keep track of the number of buffered docs!)

asfimport commented 18 years ago

Doron Cohen (migrated from JIRA)

I agree - I also suspected it might change the merge behavior (and also had reflections from the repeated trials to have that simple Indexwriter buffered-docs patch correct...:-).

Guess I just wanted to get a feeling if there is interest to include this patch before I delve into it too much - and the perf test was meant to see for my self if it really helps. I was a bit surprised that it speeds 9 times in an interleaving add/delete scenario. Guess this by itself now justifies delving into this patch, analyzing merge behavior as you suggest - will do - I think idealy should try this patch not to modify the merge behavior.

About the test - l was trying to test what I thought is a realistic use scenario (max-buf, etc.) - I have a fixed version of the perf test that is easier to modify for different scenarios - can upload it here if there is interest.

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

This patch features the new more robust merge policy. Reference on the new policy is at http://www.gossamer-threads.com/lists/lucene/java-dev/35147

The patch passes all the tests except that one in TestIndexModifier (see an earlier comment on this issue).
Since the test itself has a problem, it is fixed (one line change) and the patch passes the fixed test.
A new test call TestIndexWriterMergePolicy is included which shows the robustness of the new merge policy.

The following is a detailed description of the new merge policy and its properties.

Overview of merge policy:

A flush is triggered either by close() or by the number of ram segments reaching maxBufferedDocs. After a disk segment is created by the flush, further merges may be triggered.

LowerBound and upperBound set the limits on the doc count of a segment which may be merged. Initially, lowerBound is set to 0 and upperBound to maxBufferedDocs. Starting from the rightmost* segment whose doc count

lowerBound and <= upperBound, count the number of consecutive segments whose doc count <= upperBound.

Case 1: number of worthy segments < mergeFactor, no merge, done. Case 2: number of worthy segments == mergeFactor, merge these segments. If the doc count of the merged segment <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Case 3: number of worthy segments > mergeFactor (in the case mergeFactor M changes), merge the leftmost* M segments. If the doc count of the merged segment <= upperBound, consider the merged segment for further merges on this same level. Merge the now leftmost* M segments, and so on, until number of worthy segments < mergeFactor. If the doc count of all the merged segments <= upperBound, done. Otherwise, set lowerBound to upperBound, and multiply upperBound by mergeFactor, go through the process again. Note that case 2 can be considerd as a special case of case 3.

This merge policy guarantees two invariants if M does not change and segment doc count is not reaching maxMergeDocs: B for maxBufferedDocs, f(n) defined as ceil(log_M(ceil(n/B))) 1: If i (left) and i+1 (right) are two consecutive segments of doc counts x and y, then f(x) >= f(y). 2: The number of committed segments on the same level (f(n)) <= M.

asfimport commented 18 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Thanks for separating out the new merge policy Ning! I'm reviewing the patch now... Assuming everything looks good (it does so far), I'm inclined to commit it. I'm just giving a heads up to other lucene developers as this is a change in behavior to core lucene.

I think the new merge policy is a positive change because:

flushing all ram segments separately from disk segments allows more efficient implementations of combination reader/writers (like buffered deletes) because docids won't change from the flush alone (a merge is needed to change ids)
flushing all buffered docs together leaves more optimization possibilities... something other than single-doc segments could be used to buffer in-mem docs in the future.
increases indexing performance in the presence of deleted documents or partially full segments (merges are minimized while the number of segments are maximized).
fixes worst-case behavior that can cause the number of segments to grow too large (way more than mergefactor)

Are there any concerns?

asfimport commented 18 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I also did a quick indexing performance test w/ Solr:

maxBufferedDocs=100, mergeFactor=4, did 100K random overwriting adds in batches of 75 (75 docs added, dups deleted). It was 12% faster with this new merge policy.

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

This is to update the delete-support patch after the commit of the new merge policy.

Very few changes to IndexWriter.
The patch passes all tests.
A new test call TestNewIndexModifierDelete is added to show different scenarios when using delete methods in NewIndexModifier.

asfimport commented 18 years ago

Ning Li (migrated from JIRA)

[[ Old comment, sent by email on Thu, 6 Jul 2006 07:53:35 -0700 ]]

Hi Otis,

I will regenerate the patch and add more comments. :-)

Regards, Ning

         "Otis Gospodnetic                                             
         (JIRA)"                                                       
         &lt;jira@apache.org&gt;                                          To 
                                   ningli@almaden.ibm.com              
         07/05/2006 11:25                                           cc 
         PM                                                            
                                                               Subject 
                                   [jira] Commented: (LUCENE-565)      
                                   Supporting deleteDocuments in       
                                   IndexWriter (Code and Performance   
                                   Results Provided)                   

[

http://issues.apache.org/jira/browse/LUCENE-565?page=comments#action_12419396 ]

Otis Gospodnetic commented on LUCENE-565:

I took a look at the patch and it looks good to me (anyone else had a look)? Unfortunately, I couldn't get the patch to apply :(

$ patch -F3 < IndexWriter.patch (Stripping trailing CRs from patch.) patching file IndexWriter.java Hunk #1 succeeded at 58 with fuzz 1. Hunk #2 succeeded at 112 (offset 2 lines). Hunk #4 succeeded at 504 (offset 33 lines). Hunk #6 succeeded at 605 with fuzz 2 (offset 57 lines). missing header for unified diff at line 259 of patch (Stripping trailing CRs from patch.) can't find file to patch at input line 259 Perhaps you should have used the -p or --strip option? The text leading up to this was: ... ... ... File to patch: IndexWriter.java patching file IndexWriter.java Hunk #1 FAILED at 802. Hunk #2 succeeded at 745 with fuzz 2 (offset -131 lines). 1 out of 2 hunks FAILED – saving rejects to file IndexWriter.java.rej

Would it be possible for you to regenerate the patch against IndexWriter in HEAD?

Also, I noticed ^Ms in the patch, but I can take care of those easily (dos2unix).

Finally, I noticed in 2-3 places that the simple logging via "infoStream" variable was removed, for example:

if (infoStream != null) infoStream.print("merging segments");

Perhaps this was just an oversight?

Looking forward to the new patch. Thanks!

Provided)

a IndexWriter http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049

to deleting version using batches.

– This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa

For more information on JIRA, see: http://www.atlassian.com/software/jira

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

With the recent commits to IndexWriter, this patch no longer applies cleanly. The 5 votes for this issue encourages me to submit yet another patch. :-) But before I do that, I'd like to briefly describe the design again and welcome all suggestions that help improve it and help get it committed. :-)

With the new merge policy committed, the change to IndexWriter is minimal: three zero-or-one-line functions are added and used. 1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs and used in maybeFlushRamSegments() 2 anythingToFlushRam(): return true if number of ram segments > 0 and used in flushRamSegments() 3 doAfterFlushRamSegments(): do nothing and called in mergeSegments() if the merge is on ram segments

The new IndexModifier is a subclass of IndexWriter and only overwrites the three functions described above. 1 timeToFlushRam(): return true if number of ram segments >= maxBufferedDocs OR if number of buffered deletes >= maxBufferedDeletes 2 anythingToFlushRam(): return true if number of ram segments > 0 OR if number of buffered deletes > 0 3 doAfterFlushRamSegments(): properly flush buffered deletes

The new IndexModifier supports all APIs from the current IndexModifier except one: deleteDocument(int doc). I had commented on this before: "I deliberately left that one out. This is because document ids are changing as documents are deleted and segments are merged. Users don't know exactly when segments are merged thus ids are changed when using IndexModifier."

This behaviour is true for both the new IndexModifier and the current IndexModifier. If this is preventing this patch from getting accepted, I'm willing to add this, but I will detail this in the Java doc so users of this function are aware of this behaviour.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

What are the reasons to not add the NewIndexModifier to Lucene? This issue has already 6 votes, so it seems to be very popular amongst users (there is only one issue that has more votes). I can say that I'm using it for a couple of months already, it works flawlessly and made my life a lot easier ;-)

I think the main objections were that too many changes to IndexWriter were made in the earliest versions of this patch, but with the new merge policy committed, most of the new code is in the new class NewIndexModifier whereas the changes to IndexWriter are minimal.

So I would like to encourage committer(s) to take another look, I think this would be a nice feature for the next Lucene release.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Lack of committer time... I've been busy enough that I've shied away from complexity and gravitated toward issues that I can handle in a single bite. I'm on PTO until the end of the year, so I expect my time to be more compressed.

To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge()

It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this?

Ning, please do produce another patch to the latest trunk (but you might want to wait until after #1777 is sorted out.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> It would be nice to not have to continually open and close readers on segments > that aren't involved in a merge. Is there a way to do this?

Hmmm, and what about segments that are involved in a merge? I assume it's a different reader that is used for deleting docs than used for merging, but it doesn't have to be...

If SegmentInfos had a cached reader, that seems like it would solve both problems. I haven't thought about it enough to figure out how doable it is though.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> If SegmentInfos had a cached reader, that seems like it would solve both problems. > I haven't thought about it enough to figure out how doable it is though.

Good idea! I think this could also be used by reopen (LUCENE-743 ) to re-use readers.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> Good idea! I think this could also be used by reopen (LUCENE-743 ) to re-use readers.

Yes, although reopen() needs more support than what would be needed for this though (namely reference counting). One thing to probably watch out for is to avoid making the single-doc ram segments more expensive.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

On 12/12/06, Ning Li <ning.li.li@gmail.com> wrote: > > To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() > > I'm not sure I get this. Buffered deletes are only applied(flushed) > during ram flush. No buffered deletes are applied in the merges of > on-disk segments.

What is important is to be able to apply deletes before any ids change. You could do it after every new lowest-level segment is written to the index (the flush), or you could choose to do it before a merge of the lowest level on-disk segments. If none of the lowest level segments have deletes, you could even defer the deletes until after all the lowest-level segments have been merged. This makes the deletes more efficient since it goes from O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs))

If we can't reuse IndexReaders, this becomes more important.

One could perhaps choose to defer deletes until a segment with deleted docs is involved in a merge.

> > It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this? > > If SegmentInfos had a cached reader, that seems like it would solve both problems. > > I haven't thought about it enough to figure out how doable it is though. > > This is a good idea! One concern, however, is that caching readers > will cause a larger memory footprint. Is it acceptable?

As I said, I haven't had time to think about it at all, but at the lowest level of reuse, it wouldn't increase the footprint at all in the event that deletes are deferred until a merge:

The specific scenario I'm thinking of is instead of doAfterFlushRamSegments() open readers delete docs close readers segmentMerger() open readers merge segments close readers

It would be: doAfterFlushRamSegments() open readers delete docs segmentMerger() merge segments close readers

This cutting out an additional open-close cycle. You are right that other forms of reader caching could increase the footprint, but it's nice to have the option of trading some memory for performance.

Yet another strategy a subclass of IndexWriter could choose is to only apply deletes to segments actually involved in a merge. Then the bigger segments in the index wouldn't continually have an reader opened and closed on them.... it could all be deferred until a close, or until there are too many deletes buffered.

Of course NewIndexModifier doesn't have to impliment all these options to start with, but it would be nice if the extension hooks in IndexWriter could support them.

Whew, this is why I was slow to get involved in this again :-)

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> or you could choose to do it before a merge of the lowest level on-disk > segments. If none of the lowest level segments have deletes, you could > even defer the deletes until after all the lowest-level segments have been > merged. This makes the deletes more efficient since it goes from > O(mergeFactor * log(maxBufferedDocs)) to O(log(mergeFactor*maxBufferedDocs))

I don't think I like this semantics, though. With the semantics in the patch, an update can be easily supported. With this semantics, an insert is flushed yet a delete before the insert may or may not have been flushed.

> You are right that other forms of reader caching could increase the footprint, > but it's nice to have the option of trading some memory for performance.

Agree. It'd be nice to cache all readers... :-)

Thanks again for your comments. Enjoy your PTO!

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Hmmm, I see your point... If deletes are deferred, a different reader could go and open the index and see the additions but not the deletions.

Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation)

Anyway, it's less of a problem if opening a new reader is coordinated with the writing. That still does leave the crash scenario though.

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> Can the same thing happen with your patch (with a smaller window), or are deletes applied between writing the new segment and writing the new segments file that references it? (hard to tell from current diff in isolation)

No, it does not happen with the patch, no matter what the window size is. This is because results of flushing ram - both inserts and deletes - are committed in the same transaction.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> both inserts and deletes - are committed in the same transaction.

OK, cool. I agree that's the ideal default behavior.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList? And in methods that pass a Vector, that could be changed to a List .

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

> Minor question... in the places that you use Vector, is there a reason you aren't using ArrayList? > And in methods that pass a Vector, that could be changed to a List .

ArrayList and List can be used, respectively.

asfimport commented 17 years ago

Paul Elschot (migrated from JIRA)

I'd like to give this a try over the upcoming holidays. Would it be possible to post a single patch? A single patch can be made by locally svn add'ing all new files and then doing an svn diff on all files involved from the top directory.

Regards, Paul Elschot

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

Many versions of the patch were submitted as new code was committed to IndexWriter.java. For each version, all changes made were included in a single patch file.

I removed all but the latest version of the patch. Even this one is outdated by the commit of #1776 (lock-less commits). I was waiting for the commit of #1777 before submitting another patch. #1777 was committed this morning. So I'll submit an up-to-date patch over the holidays.

On 12/18/06, Paul Elschot (JIRA) <jira@apache.org> wrote: > I'd like to give this a try over the upcoming holidays.

That's great! We can discuss/compare the designs then. Or, we can discuss/compare the designs before submitting new patches.

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

Here is the design overview. Minor changes were made because of lock-less commits.

In the current IndexWriter, newly added documents are buffered in ram in the form of one-doc segments. When a flush is triggered, all ram documents are merged into a single segment and written to disk. Further merges of disk segments may be triggered.

NewIndexModifier extends IndexWriter and supports document deletion in addition to document addition. NewIndexModifier not only buffers newly added documents in ram, but also buffers deletes in ram. The following describes what happens when a flush is triggered:

1 merge ram documents into one segment and written to disk do not commit - segmentInfos is updated in memory, but not written to disk

2 for each disk segment to which a delete may apply open reader delete docs*, write new .delN file (* Care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.) close reader, but do not commit - segmentInfos is updated in memory, but not written to disk

3 commit - write new segments_N to disk

Further merges for disk segments work the same as before.

As an option, we can cache readers to minimize the number of reader opens/closes. In other words, we can trade memory for better performance. The design would be modified as follows:

1 same as above

2 for each disk segment to which a delete may apply open reader and cache it if not already opened/cached delete docs*, write new .delN file

3 commit - write new segments_N to disk

The logic for disk segment merge changes accordingly: open reader if not already opened/cached; after a merge is complete, close readers for the segments that have been merged.

asfimport commented 17 years ago

Jeremy F. Kassis (migrated from JIRA)

Happy New Year everyone. I'm personally very excited about this improvement to Lucene. It really begins to open Lucene up to service highly mutable data, important for the application I'm developing. Following the thread, it looks like quite a few people have favorably reviewed the patch. Perhaps it's time for a blessing and commit?

asfimport commented 17 years ago

Ning Li (migrated from JIRA)

The patch is updated because of the code committed to IndexWriter since the last patch. The high-level design is the same as before. See comments on 18/Dec/06.

Care has been taken to make sure if writer/modifier tries to commit but hits disk full that writer/modifier remains consistent and usable. A test case is added to TestNewIndexModifierDelete to test this.

All tests pass.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks for redoing the patch Ning! I like the added test case for disk full.

I've reviewed this and it looks great. I fixed a few small typos and whitespace issues (looks like a line-wrapper had jumped in at some point) and attached NewIndexModifier.Jan2007.take2.patch

I think this is the only issue holding up a Lucene 2.1 release (so far?). Yonik (or anyone) do you have any objections / questions about this patch? It's basically unchanged from before, just modified to accommodate recent fixes to IndexWriter.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I just reviewed this, and it looks good to me. I like how you managed to enable parallel analysis in updateDocument() too.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I tried the new patch out and everything looks good to me. One comment though: The public method NewIndexModifier.flush() just calls the public method flushRamSegments(). It might be confusing to have two public methods that do exactly the same?

Besides this minor question I'm all for committing this patch.

apache / lucene

Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) [LUCENE-565] #1643

1747

– This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa