apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.69k stars 1.04k forks source link

Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) [LUCENE-565] #1643

Closed asfimport closed 17 years ago

asfimport commented 18 years ago

Today, applications have to open/close an IndexWriter and open/close an IndexReader directly or indirectly (via IndexModifier) in order to handle a mix of inserts and deletes. This performs well when inserts and deletes come in fairly large batches. However, the performance can degrade dramatically when inserts and deletes are interleaved in small batches. This is because the ramDirectory is flushed to disk whenever an IndexWriter is closed, causing a lot of small segments to be created on disk, which eventually need to be merged.

We would like to propose a small API change to eliminate this problem. We are aware that this kind change has come up in discusions before. See http://www.gossamer-threads.com/lists/lucene/java-dev/23049?search_string=indexwriter%20delete;#23049 . The difference this time is that we have implemented the change and tested its performance, as described below.

API Changes


We propose adding a "deleteDocuments(Term term)" method to IndexWriter. Using this method, inserts and deletes can be interleaved using the same IndexWriter.

Note that, with this change it would be very easy to add another method to IndexWriter for updating documents, allowing applications to avoid a separate delete and insert to update a document.

Also note that this change can co-exist with the existing APIs for deleting documents using an IndexReader. But if our proposal is accepted, we think those APIs should probably be deprecated.

Coding Changes


Coding changes are localized to IndexWriter. Internally, the new deleteDocuments() method works by buffering the terms to be deleted. Deletes are deferred until the ramDirectory is flushed to disk, either because it becomes full or because the IndexWriter is closed. Using Java synchronization, care is taken to ensure that an interleaved sequence of inserts and deletes for the same document are properly serialized.

We have attached a modified version of IndexWriter in Release 1.9.1 with these changes. Only a few hundred lines of coding changes are needed. All changes are commented by "CHANGE". We have also attached a modified version of an example from Chapter 2.2 of Lucene in Action.

Performance Results


To test the performance our proposed changes, we ran some experiments using the TREC WT 10G dataset. The experiments were run on a dual 2.4 Ghz Intel Xeon server running Linux. The disk storage was configured as RAID0 array with 5 drives. Before indexes were built, the input documents were parsed to remove the HTML from them (i.e., only the text was indexed). This was done to minimize the impact of parsing on performance. A simple WhitespaceAnalyzer was used during index build.

We experimented with three workloads:


Insert only 116 min 119 min 116 min Insert/delete (big batches) – 135 min 125 min Insert/delete (small batches) – 338 min 134 min

As the experiments show, with the proposed changes, the performance improved by 60% when inserts and deletes were interleaved in small batches.

Regards, Ning

Ning Li Search Technologies IBM Almaden Research Center 650 Harry Road San Jose, CA 95120

perf-test-res.JPG

perf-test-res2.JPG


Migrated from LUCENE-565 by Ning Li, 8 votes, resolved Feb 13 2007 Attachments: LUCENE-565.Feb2007.patch, NewIndexModifier.Jan2007.patch, NewIndexModifier.Jan2007.take2.patch, NewIndexModifier.Jan2007.take3.patch, NewIndexModifier.Sept21.patch, perfres.log, perf-test-res.JPG, perf-test-res2.JPG, TestBufferedDeletesPerf.java Linked issues:

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The flush() was added to better match the current IndexModifier, based on feedback (bullet 6) above:

https://issues.apache.org/jira/browse/LUCENE-565#action_12428035

Actually, back when that feedback was given, flushRamSegments() was still private. I agree it's awkward now to have two separate methods that do the same thing.

But, I prefer "flush" over "flushRamSegments" because flush() is more generic so it reveals less about how the IndexWriter makes use of its RAM and leaves freedom in the future to have more interesting use of RAM (like KinoSearch as one example).

So I think the right fix would be to add a public IndexWriter.flush() that just calls flushRamSegments, and then make flushRamSegments private again, then remove the flush() method from NewIndexModifier? (The public flushRamSegments() has not yet been released so making it private again before we release 2.1 is OK).

Any objections to this approach? I will re-work the last patch & attach it.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Thanks for the explanation, Mike. I'd prefer flush() too and the changes you suggest look good to me!

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I've attached NewIndexModifier.Jan2007.take3.patch with that approach.

I plan on committing this in the next day or two if there are no more questions/feedback....

Thank you Ning for this great addition, and for persisting through this long process!

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I just committed this.

Thank you Ning. Keep the patches coming!

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Reopening based on recent discussions on java-dev:

http://www.gossamer-threads.com/lists/lucene/java-dev/45099
asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK I moved NewIndexModifier's methods into IndexWriter and did some small refactoring, tightening up protections, fixed javadocs, indentation, etc. NewIndexModifier is now removed.

I like this solution much better!

I also increased the default number of deleted terms before a flush is triggered from 10 to 1000. These buffered terms use very little memory so I think it makes sense to have a larger default?

So, this adds these public methods to IndexWriter:

public void updateDocument(Term term, Document doc, Analyzer analyzer) public void updateDocument(Term term, Document doc) public synchronized void deleteDocuments(Term[] terms) public synchronized void deleteDocuments(Term term) public void setMaxBufferedDeleteTerms(int maxBufferedDeleteTerms) public int getMaxBufferedDeleteTerms()

And this public field:

public final static int DEFAULT_MAX_BUFFERED_DELETE_TERMS = 10;

On the extensions points, we had previously added these 4:

protected void doAfterFlushRamSegments(boolean flushedRamSegments) protected boolean timeToFlushRam() protected boolean anythingToFlushRam() protected boolean onlyRamDocsToFlush()

I would propose that instead we add only the first one above, but rename it to "doAfterFlush()". This is basically a callback that a subclass could use to do its own thing after a flush but before a commit.

But then I don't think we should add any of the others. The "timeToFlushRam()" callback isn't really needed now that we have a public "flush()" method. And the other two are very specific to how IndexWriter implements RAM buffering/flushing and so unless/until we can think of a use case that needs these I'm inclined to not include them?

Yonik, is there something in Solr that would need these last 2 callbacks?

I've attached the patch (LUCENE-565.Feb2007.patch) with these changes!

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

OK I moved NewIndexModifier's methods into IndexWriter and did some small refactoring, tightening up protections,

> I would propose that instead we add only the first one above, but rename it to "doAfterFlush()".

Yes, that sounds fine.

The problem is that we wouldn't be able to take advantage of the hook because of the "tightening up protections". Access to the segments is key.

So instead of changing these to private, how about package protected?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, got it. I will change those 3 to package protection and then commit. Thanks Yonik.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Closing all issues that were resolved for 2.1.