apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.73k stars 1.05k forks source link

IndexWriter should immediately resolve deleted docs to docID in near-real-time mode [LUCENE-2047] #3122

Open asfimport opened 15 years ago

asfimport commented 15 years ago

Spinoff from #2600.

When deleteDocuments(Term) is called, we currently always buffer the Term and only later, when it's time to flush deletes, resolve to docIDs. This is necessary because we don't in general hold SegmentReaders open.

But, when IndexWriter is in NRT mode, we pool the readers, and so deleting in the foreground is possible.

It's also beneficial, in that in can reduce the turnaround time when reopening a new NRT reader by taking this resolution off the reopen path. And if multiple threads are used to do the deletion, then we gain concurrency, vs reopen which is not concurrent when flushing the deletes.


Migrated from LUCENE-2047 by Michael McCandless (@mikemccand), updated May 09 2016 Attachments: LUCENE-2047.patch (versions: 7)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Is this going to be an option or default to true only when NRT is on?

Also, I can create a patch for this.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think simply default to true.

Yes, please cons up a patch!

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Deletes occur immediately if poolReader is true

I'm not sure updateDocument needs to delete immediately, as it's also writing a document, the deletes later would be lost in the noise.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thanks Jason!

I would think we do want to delete live for updateDocument as well? It's not clear the deletion will be in the noise (if it hits a disk seek, it won't).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Pushing to 3.1...

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I would think we do want to delete live for updateDocument as well? It's not clear the deletion will be in the noise (if it hits a disk seek, it won't).

The delete shouldn't hit a disk seek because with poolReaders, the deletes are applied and carried over in RAM. I'd assume the overhead of flushing the segment would be the main consumption of CPU and perhaps IO? Which with #2390, is less of an issue.

For consistency with deleteDocument, I'll add live deleting to updateDocument.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We could very likely hit seeks resolving the term -> docID. EG to consult the terms dict index (eg page fault), terms dict itself, and then to load the posting. (Once we have pulsing, we can save that seek).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

likely hit seeks resolving the term -> docID

Right, duh! Hopefully we won't get into the delete by doc id discussion again.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I added deleting live for updateDocument.

TestNRTReaderWithThreads and TestIndexWriterReader passes.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I'd like to remove the syncing in the deleteLive methods, however this causes other tangential exceptions.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Without syncing on the writer, we run into the SR's files being deleted out from under it while it's being obtained for deletion (I think).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

After thumbing a bit through the code, somehow, and this is a bit too complex for me to venture a guess on, we'd need to prevent the deletion of the SR's files we're deleting from, even if that SR is no longer live. Which means possibly interacting with the deletion policy, or some other method. It's a bit hairy in the segment info transaction shuffling that goes on in IW.

I believe asyncing the live deletes is a good thing, as that way we're taking advantage of concurrency. The possible expense is in deleting from segments that are no longer live while the deleting is occurring.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I think there's a fundamental design issue here. What happens to documents that need to be deleted but are still in the RAM buffer? Because we're not buffering the deletes those wouldn't be applied. Or would we still buffer the deletes, then only apply them only to the new SR created from the RAM buffer when flush or getReader is called?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

we'd need to prevent the deletion of the SR's files we're deleting from, even if that SR is no longer live.

It's strange that anything here is needed, because, when you check a reader out from the pool, it's incRef'd, which should mean the files need no protection. Something strange is up... could it be that when you checkout that reader to do deletions, it wasn't already open, and then on trying to open it, its files were already deleted? (In which case, that segment has been merged away, and, the merge has committed, ie already carried over all deletes, and so you should instead be deleting against that merged segment).

So I think the sync(IW) is in fact necessary? Note that the current approach (deferring resolving term -> docIDs until flush time) aiso sync(IW)'d, so we're not really changing that, here. Though I agree it would be nice to not have to sync(IW). Really what we need to sync on is "any merge that is merging this segment away and now wants to commit". That's actually a very narrow event so someday (separate issue) if we could refine the sync'ing to that, it should be a good net throughput improvement for updateDocument.

What happens to documents that need to be deleted but are still in the RAM buffer?

Ahh, yes. We must still buffer for this case, and resolve these deletes against the newly flushed segment. I think we need a separate buffer that tracks pending delete terms only against the RAM buffer?

Also, instead of actually setting the bits in SR's deletedDocs, I think you should buffer the deleted docIDs into DW's deletesInRAM.docIDs? Ie, we do the resolution of Term/Query -> docID, but buffer the docIDs we resolved to. This is necessary for correctness in exceptional situations, eg if you do a bunch of updateDocuments, then DW hits an aborting exception (meaning its RAM buffer may be corrupt) then DW currently discards the RAM buffer, but, leaves previously flushed segments intact, so that if you then commit, you have a consistent index. Ie, in that situation, we don't want the docs deleted by updateDocument calls to be committed to the index, so we need to buffer them.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

It's strange that anything here is needed

I was obtaining the segment infos synced, had a small block of unsynced code, then synced obtaining the sometimes defunct readers. Fixed that part, then the errors went away!

the sync(IW) is in fact necessary?

I'm hoping we can do the deletes unsynced, which will make this patch a net performance gain because we're allowing multiple threads to delete concurrently (whereas today we're performing them synced at flush time, i.e. the current patch is merely shifting the term/query lookup cost from flush to deleteDocument).

buffer the deleted docIDs into DW's deletesInRAM.docIDs

I'll need to step through this, as it's a little strange to me how DW knows the doc id to cache for a particular SR, i.e. how are they mapped to an SR? Oh there's the DW.remapDeletes method? Hmm...

Couldn't we save off a per SR BV for the update doc rollback case, merging the special updated doc BV into the SR's deletes on successful flush, throwing them away on failure? Memory is less of a concern with the paged BV from the pending #2600 patch. On a delete by query with many hits, I'm concerned about storing too many doc id Integers in BufferedDeletes.

Without syncing, new deletes could arrive, and we'd need to queue them, and apply them to new segments, or newly merged segments because we're not locking the segments. Otherwise some deletes could be lost.

A possible solution is, deleteDocument would synchronously add the delete query/term to a queue per SR and return. Asynchronously (i.e. in background threads) the deletes could be applied. Merging would aggregate the incoming SR's queued deletes (as they haven't been applied yet) into the merged reader's delete queue. On flush we'd wait for these queued deletes to be applied. After flush, the queues would be clear and we'd start over. And because the delete queue is per reader, it would be thrown away with the closed reader.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Also, if the per SR delete queue were implemented, we could expose the callback, and allow users to delete by doc id, edit norms (and in the future, update field caches) for a particular IndexReader. We'd pass the reader via a callback that resembles IndexReaderWarmer, then deletes, norms updates, etc, could be performed like they can be today with a non-readonly IR.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm hoping we can do the deletes unsynced, which will make this patch a net performance gain because we're allowing multiple threads to delete concurrently (whereas today we're performing them synced at flush time, i.e. the current patch is merely shifting the term/query lookup cost from flush to deleteDocument).

Merely shifting the cost off the critical reopen path is already a good step forward :)

But I agree, if we can also allow deletes to happen concurrently, that would be fabulous.

Currently the buffered deleted docIDs in DW are only due to exceptions hit while adding a document. EG say for a given doc there was an exception during analysis, after yielding 100 tokens. At this point DW's RAM buffer (postings) has already been changed to hold this docID, and we can't easily undo that. So we instead mark the docID as immediately deleted. These deleted docIDs can then carry in RAM for some time, and get remapped when merges commit.

On a delete by query with many hits, I'm concerned about storing too many doc id Integers in BufferedDeletes.

I agree, that's a risk if we are buffering an Integer for each deleted docID, so we should avoid that.

We really just need an efficient IntSet. Or, it's even possible a data structure that does no deduping (eg a simple growable int[]) would be fine, since apps would tend not to delete the same docID over and over. Since we flush when deletes use too much RAM, we'd be protected from silly apps...

Or maybe start with a growable int[], but cutover to BV once that's too large? This would protect the worst case of deleting by a single query matching many docs in the index.

Whatever data structure it is, it should live under BufferedDeletes, so the exception logic that already discards these deletes on hitting an aborting exception will just continue to work.

Memory is less of a concern with the paged BV from the pending #2600 patch

I'm not sure we even want to switch to a paged BV for NRT in general (just added comment to that effect), though I guess for just this case (buffering deletes in IW in NRT mode) it could make sense?

Couldn't we save off a per SR BV for the update doc rollback case, merging the special updated doc BV into the SR's deletes on successful flush, throwing them away on failure?

I like that approach. Because it means the normal case (no exception is hit) is very fast (no merging of the two BufferedDeletes, on flush, like we do today; and no docID remapping required), and only on exception must we restore the previously saved deletes.

Another alternative would be to redefine the semantics of IW on hitting an error.... right now, if you hit an aborting exception (one that may have corrupted DW's internal RAM postings), you only lose those docs buffered in DW at the time. Any already flushed segments & new deletions within this IW session are preserved. So in theory if you closed your IW, those changes would persist.

We could consider relaxing this, such that the entire session of IW is rolled back, to the last commit/when-IW-was-opened, just like we now do with OOME.

A possible solution is, deleteDocument would synchronously add the delete query/term to a queue per SR and return. Asynchronously (i.e. in background threads) the deletes could be applied.

I'd prefer not to add further BG threads to IW, ie, take the app's thread to perform the deletion. If the app wants concurrency for deletes, it can use multiple threads.

I believe we only have to sync on the merge committing its deletes, right? So we should make a separate lock for that?

And, on commit() we must also wait for all in-flight deletes to finish.

Finally, when a new segment is flushed, we should resolve the buffered Term/Query deletions against it, during the flush?

Also, if the per SR delete queue were implemented, we could expose the callback, and allow users to delete by doc id, edit norms (and in the future, update field caches) for a particular IndexReader. We'd pass the reader via a callback that resembles IndexReaderWarmer, then deletes, norms updates, etc, could be performed like they can be today with a non-readonly IR.

I'd like to strongly decouple this discussion of extensibility, from what we're doing in this issue. I think it's a good idea, but let's handle it separately – maybe under #3101 (refactoring IW, which'd pull out the reader pool). This issue should all be "under the hood" improvements.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Thinking more on this... I'm actually no longer convinced that this change is worthwhile.

Net/net this will not improve the dps/qps throughput on a given fixed hardware, because this is a zero sum game: the deletes must be resolved one way or another.

Whether we do it in batch (as today), or incrementally/concurrently, one at a time as they arrive, the same work must be done. In fact, batch should be less costly in practice since it clearly has temporal locality in resolving terms -> postings, so on a machine whose IO cache can't hold the entire index in RAM, bulk flushing should be a win.

It's true that net latency of reopen will be reduced by being incremental, but Lucene shouldn't aim to be able to reopen 100s of times per second: I think that's a mis-feature (most apps don't need it), and those that really do can and should use an approach like Zoie.

Finally, one can always set the max buffered delete terms/docs to something low, to achieve this same tradeoff. It's true that won't get you concurrent resolving of deleted Terms -> docIDs, but I bet in practice that concurrency isn't necessary (ie the performance of a single thread resolving all buffered deletes is plenty fast).

If the reopen time today is plenty fast, especially if you configure your writer to flush often, then I don't think we need incremental resolving of the deletions?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I want to replay how DW handle the updateDoc call to see if my understanding is correct.

1: Analyzing hits an exception for a doc, it's doc id has already been allocated so we mark it for deletion later (on flush?) in BufferedDeletes.

2: RAM Buffer writing hits an exception, we've had updates which marked deletes in current segments, however they haven't been applied yet because they're stored in BufferedDeletes docids. They're applied on successful flush.

Are these the two scenarios correct or am I completely off target? If correct, isn't update doc already deleting in the foreground?

prefer not to add further BG threads

Maybe we can use 1.5's ReentrantReadWriteLock to effectively allow multiple del/update doc calls to concurrently acquire the read lock, and perform the deletes in the foreground. The write lock could be acquired during commitDeletes, commit(), and after a segment is flushed? I'm not sure it would be necessary to acquire this write lock anytime segment infos is changed?

I think it's important to remove unnecessary global locks on unitary operations (like deletes). We've had great results removing these locks for isDeleted, (NIO)FSDirectory where we didn't think there'd be an improvement, and there was. I think this patch (or a follow on one that implements the shared lock solution) could effectively increase throughput (for deleting and updating), measurably.

Lucene shouldn't aim to be able to reopen 100s of times per second

Reopening after every doc could be a valid case that I suspect will come up again in the future. I don't think it's too hard to support.

It's true that net latency of reopen will be reduced by being incremental, but Lucene shouldn't aim to be able to reopen 100s of times per second:

Perhaps update/del throughput will increase because of the shared lock which would makes the patch(s) worth implementing.

but I bet in practice that concurrency isn't necessary (ie the performance of a single thread resolving all buffered deletes is plenty fast).

We thought the same thing about the sync in FSDirectory, and it turned out that in practice, NIOFSDir is an order of magnitude faster on *nix machines. For NRT, every little bit of concurrency will probably increase throughput. (i.e. most users will have their indexes in IO cache and/or a ram dir, which means we wouldn't be penalizing concurrency as we are today with the global lock IW for del/up docs).

I'm going to go ahead and wrap up this patch, which will shift deletion cost to the del/up methods (still synchronously). Then create a separate patch that implements the shared lock solution.

Exposing SRs for updates by the user can be done today, I'll open a patch for this.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Reopening after every doc could be a valid case that I suspect will come up again in the future.

I suspect the vast majority of apps would be fine with 10 reopens per second.

Those that really must reopen 100s of times per second can use Zoie, or an approach like it.

I don't think it's too hard to support.

Whoa! Merely thinking about and discussing even how to run proper tests for NRT, let alone the possible improvements to Lucene on the table, is sucking up all my time ;)

I think it's important to remove unnecessary global locks on unitary operations (like deletes).

Yeah, I agree we should in general always improve our concurrency. In this case, resolving deletes syncs the entire IW + DW, so that blocks indexing new docs, launching/committing merges, flushing, etc. which we should fix. I just don't think NRT is really a driver for this...

1: Analyzing hits an exception for a doc, it's doc id has already been allocated so we mark it for deletion later (on flush?) in BufferedDeletes.

Analyzing or any other "non-aborting" exception, right.

2: RAM Buffer writing hits an exception, we've had updates which marked deletes in current segments, however they haven't been applied yet because they're stored in BufferedDeletes docids. They're applied on successful flush.

No – the deletes are buffered as Term, Query or docids, in the BufferedDeletes. The only case that buffers docids now is your #1 above. On successful flush, these buffered things are moved to the deletesFlush (but not resolved). They are only resolved when we decide it's time to apply them (just before a new merge starts, or, when we've triggered the time-to-flush-deletes limits).

Maybe we can use 1.5's ReentrantReadWriteLock to effectively allow multiple del/update doc calls to concurrently acquire the read lock, and perform the deletes in the foreground.

I think that should work well?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

resolving deletes syncs the entire IW + DW, so that blocks indexing new docs, launching/committing merges, flushing, etc... I just don't think NRT is really a driver for this...

Right, I think it's a general improvement that's come to light during NRT development and testing. I think NRT is great in this regard because it stresses Lucene in a completely new way, which improves it for the general batch use case (i.e. users can simply start utilizing NRT features when they need to without worry).

1: Analyzing hits an exception for a doc, it's doc id has already been allocated so we mark it for deletion later (on flush?) in BufferedDeletes.

So there's only one use case right now, which is only when analyzing an individual doc fails. The update doc adds the term to the BufferedDeletes for later application. Makes sense. I think we can resolve the update doc term in the foreground. I'm wondering if we need a different doc id queue for these? I get the hunch yes, because the other doc ids need to be applied even on IO exception, whereas update doc id will not be applied?

2: RAM Buffer writing hits an exception, we've had updates which marked deletes in current segments, however they haven't been applied yet because they're stored in BufferedDeletes docids. They're applied on successful flush.

In essence we need to implement number 2?

Analyzing or any other "non-aborting" exception, right.

What is an example of a non-aborting exception?

use 1.5's ReentrantReadWriteLock

I'll incorporate RRWL into the follow on concurrent updating patch.

Whoa! Merely thinking about and discussing even how to run proper tests for NRT, let alone the possible improvements to Lucene on the table, is sucking up all my time

Yow, I didn't know. Thanks!

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I'm browsing through the applyDeletes call path, I'm tempted to rework how we're doing this. For my own thinking I'd still like to have a queue of deletes per SR and for the ram doc buffer. I think this gives future flexibility and makes it really clear when debugging what's happening underneath. I find the remapping doc ids to be confusing, meaning stepping through the code would seem to be difficult. If we're storing doc ids alongside the SR the docs correspond to, there's a lot less to worry about and just seems clearer. This may make integrating #2390 directly into IW more feasible as then we're working directly at the SR level, and can tie the synchronization process together. Also this could make exposing SRs externally easier and aid in making IW more modular in the future?

I can't find the code that handles aborts.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

1: Analyzing hits an exception for a doc, it's doc id has already been allocated so we mark it for deletion later (on flush?) in BufferedDeletes.

So there's only one use case right now, which is only when analyzing an individual doc fails. The update doc adds the term to the BufferedDeletes for later application.

No, it adds the docID for later application. This is the one case (entirely internal to IW) where we delete by docID in the writer.

I think we can resolve the update doc term in the foreground. I'm wondering if we need a different doc id queue for these? I get the hunch yes, because the other doc ids need to be applied even on IO exception, whereas update doc id will not be applied?

I think we can use the same queue – whether they are applied or not follows exactly the same logic (ie, successful flush moves all deletesInRAM over to deletesFlushed), ie, an aborting exception cleares the deletesInRAM.

2: RAM Buffer writing hits an exception, we've had updates which marked deletes in current segments, however they haven't been applied yet because they're stored in BufferedDeletes docids. They're applied on successful flush.

In essence we need to implement number 2?

I'm confused – #2 is already what's implemented in IW, today.

The changes on the table now are to:

Otherwise we don't need to change what's done today (ie, keep the deletesInRAM vs deletesFlushed)?

What is an example of a non-aborting exception?

Anything hit during analysis (ie, TokenStream.incrementToken()), or anywhere except DocInverterPerField in the indexing chain (eg if we hit something when writing the stored fields, I don't think it'll abort).

I'm browsing through the applyDeletes call path, I'm tempted to rework how we're doing this.

I think this would be a good improvement – it would mean we don't need to ever remapDeletes, right?

The thing is... this has to work in non-NRT mode, too. Somehow, you have to buffer such a deleted docID against the next segment to be flushed (ie the current RAM buffer). And once it's flushed, we move the docID from deletesInRAM (stored per-SR) to the SR's actual deleted docs BV.

We would still keep the deletes partitioned, into those done during the current RAM segment vs those successfully flushed, right?

I can't find the code that handles aborts.

It's DW's abort() method, and eg in DocInverterPerField we call DW.setAborting on exception to note that this exception is an aborting one.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Mike, you mentioned testing getReader missing deletes etc (in response to potential file handle leakage), which test or benchmark did you use for this?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

In the updateDocument and deleteDocument methods, deletes are buffered per segment reader synchronized on writer. Immediately after, outside the sync block, they're deleted from the existing SRs. If a new SR is added, it's because of a flush (which has it's own buffered deletes), or from a merge which is synchronized on writer. In other words, we won't lose deletes, they'll always be applied on a flush, and the resolution of deletes to doc ids happens un-synchronized on writer.

Update document adds the term to the queued deletes, then resolves and adds the doc ids to an Integer list (for now). This is where we may want to use an growable int[] or int set.

Flush applies queued update doc deleted doc ids to the SRs.

commitMerge merges queued deletes from the incoming SRs. Doc ids are mapped to the merged reader.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

When docWriter aborts on the RAM buffer, we clear out the queued updateDoc deletes.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

DocWriter abort created a deadlock noticed in TestIndexWriter.testIOExceptionDuringAbortWithThreads. This is fixed by clearing via the reader pool. Other tests fail in TIW.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

TestIndexWriter passes, mostly due to removing assertions in reader pool release which presumed deletions would be synchronized with closing a reader. They're not anymore. Deletions can come in at almost anytime, so a reader may be closed by the pool while still carrying deletes. The releasing thread may not be synchronized on writer because we're allowing deletions to occur un-synchronized.

I suppose we need more tests to insure the assertions are in fact not needed.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.