Integrate IndexReader with IndexWriter [LUCENE-1516]

asfimport commented 15 years ago

The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates.

This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone.

Migrated from LUCENE-1516 by Jason Rutherglen, resolved Apr 09 2009 Attachments: LUCENE-1516.patch (versions: 28), magnetic.png, ssd.png, ssd2.png

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

It looks like DirectoryIndexReader needs to have an IndexWriter mode as unfortunately subclassing won't work. In this context overriding a method implies the IW mode is being used.

I assume we'll share the segmentInfos object from IW rather than share a clone with the IR?

DirectoryIndexReader.doCommit needs to be overridden
IndexReader uses the IndexDeletionPolicy from IndexWriter
DirectoryIndexReader.acquireWriteLock is overridden and synchronizes on the write lock of IndexWriter, other details need to be worked out
Method requiring synchronization in IW: optimize, expungeDeletes, prepareCommit, flush, addIndexes, methods that modify segmentInfos. Do we synchronize on IW or writeLock?
addDocument, updateDocument, deleteDocument do not seem to require synchronization
IW.getReader returns a cloned reader. IW keeps it's own reference. This is to allow IW to perform deletes using the internal reader rather than open a new reader and in general not affect IW's reader while still sharing resources.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

It still seems to me that we should think about what objects need to be shared and how they can be produced/instantiated appropriately instead of adding a Reader onto the Writer, which, IMO, pollutes the Writer API. I know it complicates things, but I think it will be less confusing to users of the API.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Patch includes the latest patch from #2391 and is unfortunately not readable. Is there a way to create a patch minus another patch?

The IW internal reader is always the results of the latest external IR.reopen. In this way deletes made in IW and from the external IR remain in sync. The latest IR always has the write lock anyways according to IR.clone semantics.

DocumentsWriter.applyDeletes reuses the reader in IW rather than call SegmentReader.get/.deleteDocument

In IW.doFlush the internal reader is reopened with the new segmentinfos

Included a very basic test case

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

This is an extremely early patch and has a bunch of System.outs for debugging

TestIndexWriterReader.testMerge tries to use IW.addIndexes and fails because the lookup of the SegmentReader from the IW internal IR returns the incorrect reader (I spent a lot of time on trying to figure out why and could not).
TestIndexWriterReader.testReader passes

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Some terminology for this patch, an internal reader is the IndexWriter's reader. An external reader given via the IW.reopenReader method.

If DirectoryIndexReader has a writer, no lock is acquired on updates. IR.clone normally passes the writeLock to the new reader, however the external IR and the IW internal IR both need to hold the write lock. For this reason the user must be careful when flushing to insure the proper instance of the IR's deletes are merged with the writer.

The external IR.flush does not flush the deletes to disk, instead it merges with the IW's internal IR which is in RAM. IW.flush causes deletes and new segments to be flushed to the directory.

The test cases from TestIndexWriterReader testIndexWriterDeletes and testIndexWriterReopenSegment fail when the IW is opened again after commit and close. The index files are being deleted during IW.commit. I traced this to IW.finishCommit -> deleter.checkpoint -> deleter.deletePendingFiles.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Added TestIndexWriterReader.testIndexWriterDeletes, testIndexWriterDeletesOptimize, testIndexWriterReopenSegment, testIndexWriterReopenSegmentOptimize. Where the optimize methods fail, the non optimize ones work. The optimize methods delete _0.fdt and _0.fdx and so fail when the writer is created again because it cannot find those files. It could be a segment infos merging problem or something else.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The previously mentioned issue in the test case is fixed. Perhaps IW.reopenInternalReader() should be called in IW.checkpoint?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks good, Jason. This is big change, and I expect to go through a number of iterations before settling... plus we still need to figure out how the API is exposed. Comments:

All this logic needs to be conditional (this also depends on what API we actually settle on to expose this...): right now you always open a reader whenever IW is created.
We should assume we do not need to support autoCommit=true in this patch (since this will land after 3.0). This simplifies things.
IW.reopenInternalReader only does a clone not a reopen; how does it cover the newly flushed segment?
After a merge commits you don't seem to reopen the reader? This is actually tricky to do right, for realtime search: we somehow need to allow for warming of the newly created (merged) segment, in such a way that we do not block the flushing of further segments and reopen of readers against those new segments. I think what may be best is to subclass IW, and override a newly added "postMerge" method that's invoked on the new segment before the merge is committed into the SegmentInfos. This is cleaner than allowing the change into the SegmentInfos and then having to make a custom deletion policy & track history of each segment.
It seems like reader.reopen() (where reader was obtained with IW.getReader()) doesn't do the right thing? (ie it's looking for the most recent segments_N in the Directory, but it should be looking for it @ IW.segmentInfos).
I think we should decouple "materializing deletes down to docIDs" from "flushing deletes to disk". IW does both as the same operation now (because it doesn't want to hold SR open for a long time), but once we have persistent open SegmentReaders we should separate these. It's not necessary for IW to write new .del files when it materializes deletes.
Instead of having to merge readers, I think we should have a single source to obtain an SR from. This way, when IW needs to materialize deletes, it will grab the same instance of SR for a given segment that the currently open MSR is using. Also, when merging kicks off, it'll grab the SR from the same source (this way deletes in RAM will be correctly merged away). Also, I think we should not use MSR for doing deletions (and still go segment by segment): it's quite a bit slower since every invocation must do the binary search again.
Likewise, you have to fix the commitMergedDeletes to decouple computing the new BitVector from writing the .del file to disk. That method should only create a new BitVector, for the newly merged segment. It must be synchronized to prevent any new deletions against the segments that were just merged. In fact, this is a real danger: after a merge finishes, if one continues to use an older reader to do deletions you get into trouble.
I still don't really like having both the IR and IW able to do deletions, with slightly different semantics. As it stands now, since you can't predict when IW materializes deletes, your reader will suddenly see a bunch of deletes appear. I think it's better if no deletes appear, ever, until you reopen your reader. Maybe we simply prevent deletion through the IR?
We need some serious unit tests here!

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Mike, good points...

since you can't predict when IW materializes deletes, your reader will suddenly see a bunch of deletes appear.

The reader would need to be reopened to see the deletes. Isn't that expected behavior?

Instead of having to merge readers, I think we need a single source to obtain an SR from

I like this however how would IR.clone work? I like having the internal reader separate from the external reader. The main reason to expose IR from IW is to allow delete by doc id and norms updates (eventually column stride fields updates). I don't see how we can grab a reader during a merge, and block realtime deletes occurring on the external reader. However it is difficult to rectify deletes to an external SR that's been merged away.

It seems like we're getting closer to using a unique long UID for each doc that is carried over between merges. I was going to implement this above LUCENE-1516 however we may want to make UIDs a part of LUCENE-1516 to implement the behavior we're discussing.

If the updates to SR are queued, then it seems like the only way to achieve this is a doc UID. This way merges can happen in the background, the IR has a mechanism for mapping it's queue to the newly merged segments when flushed. Hopefully we aren't wreaking havoc with the IndexReader API?

The scenario I think we're missing is if there's multiple cloned SRs out there. With the IW checkout an SR model how do we allow cloning? A clone's updates will be placed into a central original SR queue? The queue is drained automatically on a merge or IW.flush? What happens when we want the IR deletes to be searchable without flushing to disk? Do a reopen/clone?

number of iterations before settling

Agreed, if it were simple it wouldn't be fun. ☺

It's not necessary for IW to write new .del files when it materializes deletes.

Good point, DocumentsWriter.applyDeletes shouldn't be flushing to disk and this sounds like a test case to add to TestIndexWriterReader.

IW.reopenInternalReader only does a clone not a reopen; however does it cover the newly flushed segment?

The segmentinfos is obtained from the Writer. In the test case testIndexWriterReopenSegment it looks like using clone reopens the new segments.

I think it's better if no deletes appear, ever, until you reopen your reader. Maybe we simply prevent deletion through the IR?

Preventing deletion through the IR would seem to defeat the purpose of the patch unless there's some alternative mechanism for deleting by doc id?

commitMergedDeletes to decouple computing the new BitVector from writing the .del file to disk.

A hidden method I never noticed. I'll keep it in mind.

It seems like reader.reopen() (where reader was obtained with IW.getReader()) doesn't do the right thing? (ie it's looking for the most recent segments_N in the Directory, but it should be looking for it @ IW.segmentInfos).

Using the reopen method implementation for a Reader with IW does not seem necessary. It seems like it could call clone underneath?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> since you can't predict when IW materializes deletes, your reader > will suddenly see a bunch of deletes appear.

The reader would need to be reopened to see the deletes. Isn't that expected behavior?

Ahh right, so long as we keep internal (private) clone, materializing the deletes won't affect the external reader.

> Instead of having to merge readers, I think we need a single > source to obtain an SR from

I like this however how would IR.clone work?

It should work fine? The single source would only be used internally by IW (for merging, for materializing deletes, for the internal reader).

I like having the internal reader separate from the external reader.

I think we should keep that separation.

The main reason to expose IR from IW is to allow delete by doc id and norms updates (eventually column stride fields updates). I don't see how we can grab a reader during a merge, and block realtime deletes occurring on the external reader. However it is difficult to rectify deletes to an external SR that's been merged away.

It seems like we're getting closer to using a unique long UID for each doc that is carried over between merges. I was going to implement this above LUCENE-1516 however we may want to make UIDs a part of LUCENE-1516 to implement the behavior we're discussing.

If the updates to SR are queued, then it seems like the only way to achieve this is a doc UID. This way merges can happen in the background, the IR has a mechanism for mapping it's queue to the newly merged segments when flushed. Hopefully we aren't wreaking havoc with the IndexReader API?

But... do we need delete by docID once we have realtime search? I think the last compelling reason to keep IR's delete by docID was immediacy, but realtime search can give us that, from IW, even when deleting by Term or Query?

(Your app can always add that long UID if it doesn't already have something usable).

docIDs are free to changing inside IW. I don't see how we can hand out a reader, allow deletes by docID to it, and merge those deletes back in at a later time, unless we track the genealogy of the segments?

The scenario I think we're missing is if there's multiple cloned SRs out there. With the IW checkout an SR model how do we allow cloning? A clone's updates will be placed into a central original SR queue? The queue is drained automatically on a merge or IW.flush? What happens when we want the IR deletes to be searchable without flushing to disk? Do a reopen/clone?

This is why I think all changes must be done through IW if you've opened a reader from it. In fact, with the addition of realtime search to Lucene, if we also add updating norms/column-stride fields to IW, can't we move away from allowing any changes via IR? (Ie deprecate deleteDocuments/setNorms/etc.)

> It's not necessary for IW to write new .del files when it > materializes deletes.

Good point, DocumentsWriter.applyDeletes shouldn't be flushing to disk and this sounds like a test case to add to TestIndexWriterReader.

Well, if IW has no persistent reader to hold the deletes, it must keep doing what it does now (flush immediately to disk)?

> IW.reopenInternalReader only does a clone not a reopen; however > does it cover the newly flushed segment?

The segmentinfos is obtained from the Writer. In the test case testIndexWriterReopenSegment it looks like using clone reopens the new segments.

Wait, where is this test? Maybe you need to svn add it?

And, clone should not be reopening segments...?

> I think it's better if no deletes appear, ever, until you reopen > your reader. Maybe we simply prevent deletion through the IR?

Preventing deletion through the IR would seem to defeat the purpose of the patch unless there's some alternative mechanism for deleting by doc id?

See above.

> commitMergedDeletes to decouple computing the new BitVector from > writing the .del file to disk.

A hidden method I never noticed. I'll keep it in mind.

It's actually very important. This is how IW allows deletes to materialize to docIDs, while a merge is running – any newly materialized deletes against the just-merged segments are coalesced and carried over to the newly created segment. Any further deletes must be done against the docIDs in the new segment (which is why I don't see how we can allow deletes by docID to happen against a checked out reader).

> It seems like reader.reopen() (where reader was obtained with > IW.getReader()) doesn't do the right thing? (ie it's looking for the > most recent segments_N in the Directory, but it should be looking for > it @ IW.segmentInfos).

Using the reopen method implementation for a Reader with IW does not seem necessary. It seems like it could call clone underneath?

Well, clone should be very different from reopen. It seems like calling reader.reopen() (on reader obtained from writer) should basically do the same thing as calling writer.getReader(). Ie they are nearly synonyms? (Except for small difference in ref counting – I think writer.getReader() should always incRef, but reopen only incRefs if it returns a new reader).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Added the test case to the patch.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The path forward seems to be exposing a cloned readonly reader from IW.getReader. This would be easier than doing hula hoops to do segment genealogy (at least for now ☺)

can't we move away from allowing any changes via IR? (Ie deprecate deleteDocuments/setNorms/etc.)

This would simplify things however as a thought experiment how would the setNorms work if it were a part of IndexWriter?

And, clone should not be reopening segments...?

DirectoryIndexReader.clone(boolean openReadonly) calls doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly) which is an abstract method that in SegmentReader and MultiSegmentReader reopens the segments? The segment infos for a ReaderIW is obtained from IW, which is how it knows about the new segments. Perhaps not desired behavior?

do we need delete by docID once we have realtime search? I think the last compelling reason to keep IR's delete by docID was immediacy, but realtime search can give us that, from IW, even when deleting by Term or Query?

Good point! I think we may want to support it but for now it's shouldn't be necessary. I'm thinking of the case where someone is using the field cache (or some variant), performs some sort of query on it and then needs to delete based on doc id. What do they do? Would we expose a callback mechanism where a deleteFrom(IndexReader ir) method is exposed and deletes occur at the time of the IW's choosing?

It seems like calling reader.reopen() (on reader obtained from writer) should basically do the same thing as calling writer.getReader(). Ie they are nearly synonyms? (Except for small difference in ref counting - I think writer.getReader() should always incRef, but reopen only incRefs if it returns a new reader).

Perhaps ReaderIW.reopen will call IW.getReader underneath instead of using IR's usual mechanism.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

This latest patch doesn't seem to apply.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Is there ever a need for the normal IR construction anymore? Or do we always just ask for it from the IW (or wherever we choose to expose this, as I still don't think it belongs on the IW API wise, but that isn't a big deal right now) every time? I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right?

API wise, I think we could do something like (with obvious other variations):

IndexAccessor{

  IndexWriter getWriter(Directory);

  //returns read-only reader
  IndexReader getReader(Directory);

  //returns the external IR described above
  IndexReader.getReader(IndexWriter);
}

Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Jason, I think you need to "svn up". Or, tell us which revision you're on and we can downgrade to that revision before applying the patch. (We need "svn patch"!).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The path forward seems to be exposing a cloned readonly reader from IW.getReader.

+1

> can't we move away from allowing any changes via IR? (Ie > deprecate deleteDocuments/setNorms/etc.)

This would simplify things however as a thought experiment how would the setNorms work if it were a part of IndexWriter?

I think it'd look like this?

IndexWriter.setNorm(Term term, String field, byte norm)

Ie the Term IDs the doc(s) you want to set the norm for.

> And, clone should not be reopening segments...?

DirectoryIndexReader.clone(boolean openReadonly) calls doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly) which is an abstract method that in SegmentReader and MultiSegmentReader reopens the segments? The segment infos for a ReaderIW is obtained from IW, which is how it knows about the new segments. Perhaps not desired behavior?

OK, I think it does not reopen existing segments. Meaning, if a segment is in common w/ old and new, it truly clones it (does not reopen norms nor del). But if there is a new segment that did not exist in old, it opens a whole new segment reader? I'll commit an assert that this doesn't happen – if caller passes in "doClone=true" then caller should not have passed in a segmentInfos with changes? Else the reader is on thin ice (mismatch what's in RAM vs what SegmentInfo says).

> do we need delete by docID once we have realtime search? I > think the last compelling reason to keep IR's delete by docID was > immediacy, but realtime search can give us that, from IW, even when > deleting by Term or Query?

Good point! I think we may want to support it but for now it's shouldn't be necessary. I'm thinking of the case where someone is using the field cache (or some variant), performs some sort of query on it and then needs to delete based on doc id. What do they do? Would we expose a callback mechanism where a deleteFrom(IndexReader ir) method is exposed and deletes occur at the time of the IW's choosing?

Wouldn't delete-by-Query cover this? Ie one could always make a Filter implementing the "look @ field cache, do some logic, provide docIDs to delete", wrap as Query, then delete-by-Query?

> It seems like calling reader.reopen() (on reader obtained > from writer) should basically do the same thing as calling > writer.getReader(). Ie they are nearly synonyms? (Except for small > difference in ref counting - I think writer.getReader() should always > incRef, but reopen only incRefs if it returns a new reader).

Perhaps ReaderIW.reopen will call IW.getReader underneath instead of using IR's usual mechanism.

Right, that's what I'm thinking. Once you've obtained reader coupled to a writer, you can then simply reopen it whenever you want to see (materialize) changes done by the writer.

We still need a solution for the "warm the just merged segment"... else we will not be realtime, especially when big merge finishes. It seems like after merge finishes, it should immediately 1) open a SegmentReader on the new segment, 2) invoke the method you passed in (or you subclassed – not sure which), 3) carry over deletes that materialized during the merge, 4) commit the merge (replace old segments w/ new one).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right?

Right, I think we still want to allow opening a standalone (uncoupled) reader.

Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.

I'm torn... the IndexAccessor would need to expose many variants to carry over all the options we now have (String or File or Directory, IndexCommit or not, IndexDeletionPolicy or not, create or not). It will end up exposing a number of new methods... and, would it try to be "smart" (like IndexModifier, and the LuceneIndexAccessor class in LUCENE-390), keeping track of references to the readers it's handed out, etc.? Or is it simply a pass-through to the underlying open/ctors we have today?

The alternative (as of right now, unless we are missing something further with these changes) is adding one method to IndexWriter, getReader, that returns a readOnly IndexReader, "coupled" to the writer you got it from in that it's able to search un-committed changes and if you reopen it, writer will materialize all changes and make them visible to the reopened reader.

I guess so far I don't really see why this small (one method) API change merits a switch to a whole new accessor API for creating readers & writers on an index? Maybe there is a straw-that-breaks-the-camel's-back argument that I'm missing...

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Good points, MIke, but maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory.

new IndexWriter(new Directory(indexFile));

Additionally, there are no more variants than there already are on the IW and IR, right?

As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that.

As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now. You wouldn't ask FileWriter for a FileReader, would you? I don't see why it would be good to ask a IW for an IR, API-wise (I get why we are doing this, it makes sense).

Likewise, isn't it just as logical to ask for an IW from an IR? If I have an IR already and I want it to be aware of the writes I want to do, why wouldn't we then add IR.getIW()? And then we can have total circular dependency.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory.
new IndexWriter(new Directory(indexFile));

(You'd presumably need to close that Directory). But, yeah, we may be able to drop some of them, although I do think they are convenient for new users of Lucene. And forcing users to switch to a totally new yet pass through API on ugprade, but not giving them one to one carryover, is not very nice.

Additionally, there are no more variants than there already are on the IW and IR, right?

Right, I'm just saying IndexAccessor will have many methods. And then you're asking every app to make this switch, on upgrade. It's alot of API swapping/noise vs a single added expert method to IW.

As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that.

If we think it'll be more than just pass through, we should try to hash out, somewhat, what it will & won't do up front (changing it later is a big change)? And we should start from #1468.

As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now.

But this will be an expert/advanced API, a single added method to IW. I wouldn't expect users to be confused: on upgrade I think most users will not even notice its existence!

You wouldn't ask FileWriter for a FileReader, would you?

I'm not sure that's the right comparison – Lucene's IW does far more than a FileWriter. And the fact that Lucene allows "point in time" searching (which is very useful and rather unique) is a very big difference vs FileReader/Writer.

Likewise, isn't it just as logical to ask for an IW from an IR?

I don't think so: the functionality is not symmetric, because Lucene allows only one writer open at a time, but many readers (eg on different commits). Since a writer is the one making changes, it makes sense that you'd ask it, right now, to give you a reader reflecting all changes up to that point. And call it again later to get a reader seeing changes after that, etc.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Or, here's an idea: can we do both? Put IndexAccessor as an optional "convenience" layer that simplifies the ctors and expert methods of IW & IR, but leave public direct access to the ctros & expert methods? This way on upgrade nobody is forced to migrate to an entirely new yet simply pass-through API?

Or another idea is to decouple these two discussions: go ahead and add the single expert method to IW, but as a separate discussion/JIRA work out how we can simplify overall access/management of IR/IW instances?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Ah yes, patch from the old directory that need deleting. Here's the correct one. Sorry about that.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Right, I'm just saying IndexAccessor will have many methods. And then

you're asking every app to make this switch, on upgrade. It's alot of API swapping/noise vs a single added expert method to IW.

Sure, but that is already the case w/ IW/IR anyway.

I agree about the short term noise, but in the long run it seems cleaner.

But this will be an expert/advanced API, a single added method to IW.

I wouldn't expect users to be confused: on upgrade I think most users will not even notice its existence!

Hmm, I don't agree, but I guess it depends on the performance hit. If given a choice between the semantics of a reader that sees changes as they are made versus having to do the whole reopen thing, I'm betting most users will say "duh, I want to see my changes right away" and choose the IR that is synced w/ the IW, b/c that is what people think is the logical thing to happen and it is how DBs work, which many devs. are used to. As an app developer, if I don't have to think about IR lifecycle management, why would I want to as long as it performs? What this patch is offering, AFAICT, is the removal of IR lifecycle managment from the user.

In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert". Now, if you're telling me this is going to be significantly slower even when updates are rare, then maybe I would stick to the current lifecycle, but if there isn't much difference, I'll take the one that pushes the lifecycle complexity down into Lucene instead of in my app.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

a reader that sees changes as they are made versus having to do the whole reopen thing

It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.

That's right. The current proposal is to add one method to IW:

IndexReader getReader()

that returns a point-in-time view of the index plus all changes buffered in IW up until that point. Then you can reopen that reader (or call getReader() again, which does the same thing) to quickly get a new point-in-time reader.

I think the point-in-time semantics is important to keep.

Also, you can't easily emulate point-in-time if we implemented the "live" approach, but you can easily do vice/versa (assuming we can keep reopen() time fast enough).

EG the IndexAccessor convenience layer could do automatic reopening so that when you ask it for the reader it always reopens it; this would emulate "live updates" and hide the lifecycle management.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I guess it depends on the performance hit.

It's challenging to implement truly live updates w/ decent performance: I think we'd need to build the reader impl that can search DocumentsWriter buffer.

Whereas the approach (patch) here is actually quite simple (all the hard work was already done – IndexReader.reopen, collection/sorting/filtering by segment, etc.).

In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert".

I agree: realtime search will likely be a popular feature once we finish it, release it, it proves stable, performant, etc. Eventually (maybe soon) it should be made the default.

I think IndexAccessor makes alot of sense, but it's a big change and I'd rather not couple it to this issue. There are many questions to be hashed out (under a new issue): is it a simple pass-through? Or does it manage the lifecycle of the readers for you? Does it warm new readers? Should it emulate "live" update semantics? Should getReader get it from the writer if there is one (ie, making realtime search the "default")? Etc.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, I agree. Let's just mark it as expert/subject to revision and then we're good.

We can revisit IndexAccessor separately.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

There's concurrency issues to work out.

IW.getReader returns a cloned read only reader
Removed IW.reopenReader
All test methods pass except testAddIndexesAndDoDeletesThreads. testAddIndexesAndDoDeletesThreads currently merges indexes concurrently (and fails). In the future the method will test merging, deleting, and searching concurrently.
Concurrent merges fail when "ant test-core" is run
DocumentsWriter.applyDeletes deletes again at the SegmentReader level

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Added segmentReaders, segmentReaderClone to MergePolicy.OneMerge
mergeFinish decRefs the OneMerge.segmentReaders
_mergeInit clones segmentreaders
commitMergedDeletes uses OneMerge.segmentReaders instead of loading bitvectors from the directory
testAddIndexesAndDoDeletesThreads fails with 20 less documents (80 vs 100) (an index that was supposed to be added isn't showing up)
Getting exceptions from org.apache.lucene.TestSnapshotDeletionPolicy, such as:

Caused by: java.io.IOException: read past EOF
    [junit]     at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135)
    [junit]     at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:92)
    [junit]     at org.apache.lucene.store.IndexOutput.copyBytes(IndexOutput.java:172)
    [junit]     at org.apache.lucene.index.TermVectorsWriter.addRawDocuments(TermVectorsWriter.java:185)
    [junit]     at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:447)
    [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:145)
    [junit]     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4823)
    [junit]     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4408)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think your latest test is missing the the new unit test source.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Woops, sorry – I downloaded the wrong patch.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we can simplify the approach here:

I don't think IW should track an "internal reader" at all. Instead, it tracks a pool of held-open and re-used SegmentReader instances. Maybe make a new SegmentReaderPool private class. It exposes get() to load a SegmentReader, returning the pooled one if present, else falling back to the normal SegmentReader.get.
Then, the many places where we currently pull a segment reader (to apply deletes, to do a merge, to carry over new deletes after merge finishes) as well as the new getReader() method, should use this pool to get the reader.
We would never mergeSegmentInfos, never ask the "internal reader" to commit, etc.

Other things:

Since IndexReader will always be readonly, you can simplify the new DirectoryIndexReader.open method, eg we don't need to copy over the writeLock nor the DeletionPolicy, and we don't need the readOnly param, and closeDirectory is always false. In fact we could simply create a ReadOnly{Multi,}SegmentReader directly.
I think IndexReader.clone() should not grab the writer's segmentInfos? Ie it should truly be a clone(), not a reopen. Actually, when the reader is created we should make a private full copy of the SegmentInfos.
We lost some "private" modifiers, unecessarily; eg DirectoryIndexReader.writeLock.
I don't understand why we need merge.segmentReadersClone? If we only use to detect new deletions for carrying over deletes after merge finishes, we should instead just grab the delCount when the merge kicked off?
I think the only reason for reader to hold a reference to writer is so that on reopen, the reader realizes it was created from IW.getReader and simply forwards the request to IW. Otherwise writer should not be used in reader.
Once we no longer store/maintain/merge an "internal" reader, IW.getReader simply becomes a call to IndexReader.open, with two differences: 1) we pass in a SegmentInfos instance (instead of looking for last segments_N in dir), and 2) we pass in our own SegmentReaderPool that should be used to open the SR's.
You need to add mainDir.close() in testAddIndexesAndDoDeletesThreads.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

> make a new SegmentReaderPool private class

I'd prefer the SegmentReaderPool model over the patch's existing one as it is simpler and closer to how the underlying system actually works meaning it works directly with the segments in a systematized way.

> We would never mergeSegmentInfos, never ask the "internal reader" to commit

Good, merging the segmentInfos is confusing and tricky to debug

> Since IndexReader will always be readonly, you can simplify the new DirectoryIndexReader.open method

+1

> why we need merge.segmentReadersClone?

I was modeling after the segmentInfosClone. If it's not necessary I'll happily remove it.

> I think the only reason for reader to hold a reference to writer is so that on reopen, the reader realizes it was created from IW.getReader and simply forwards the request to IW

+1

> Wouldn't delete-by-Query cover this? Ie one could always make a Filter implementing the "look @ field cache, do some logic, provide docIDs to delete", wrap as Query, then delete-by-Query? (from a previous coment)

Yes that will work.

How are these sample SegmentReaderPool method signatures?

private class SegmentReaderPool {
  public SegmentReader get(SegmentInfo si) {}

  public void decRef(SegmentReader sr) {}
}

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

In SegmentInfo it's hashCode only takes into account the name and directory, why not the delete generation?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How are these sample SegmentReaderPool method signatures?

Looks good for starters. decRef would simply pass-through to the SR's normal decRef, but would then also decRef its copy if the ref count has dropped to 1?

In SegmentInfo it's hashCode only takes into account the name and directory, why not the delete generation?

Two SegmentInfo instances are considered equal if the dir is == and the name is .equals, so hashCode must match that (be the same value when equals returns true).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Added SegmentReaderPool that maintains a map of SegmentInfoKey -> SegmentReader. SegmentInfoKey returns a hashcode of a SegmentInfo that also includes the deletes and norms generations. SegmentReaderPool also maintains a map of SegmentReaders that represent "working copies" (cloned readers) for realtime deletes occurring. When IW.getReader is called the "working copy" readers replace the existing readers keyed with the same segmentinfo in the SegmentReaderPool. This way IW.deleteDocument applies to the in memory cloned "working copies" but does not alter SegmentReaders already returned from SegmentReaderPool.
The test methods in TestIndexWriterReader pass
I'm hesitant to automatically flush on IW.getReader as IW.flush has a number of parameters. However for updateDocument to work properly we'll probably need to do this?
DocumentsWriter.applyDeletes should no longer need to apply deletes as they are already applied in realtime in IW.deleteDocument? It should only need to call SR.commitChanges?
"ant test-core" fails in ConcurrentMergeScheduler with the same exception as above in TestSnapshotDeletionPolicy (and some others). I suspect in the case of TestSnapshotDeletionPolicy the errors could be due to the SegmentReader holding files open while they are being copied. Perhaps this would explain the EOFExceptions?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Fixed the basic concurrency merge issue (TestSnapshotDeletionPolicy passes)
Many unit tests fail due to other reasons

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

TestAddIndexesNoOptimize test passes
There's an incRef issue still

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Jason could you rebase the patch on current trunk? (Or post the revision you're on?).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Revised to trunk
Still an incRef issue. TestIndexWriterReader.testDeleteFromIndexWriter fails. TestConcurrentMergeScheduler fails, as well as others
Removed the working copy stuff from SegmentReaderPool as the deletes need to be applied to the internal readers, the external IW.getReader is a clone of the internal SRs.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

We'll need a method to check the SegmentReaderPool for readers that may be removed. It seems it would check IW.segmentInfos, pending merges, current merges for segmentinfo(s) that are in use, then all others in the pool are expunged. I'm not sure if this method would need to run in a background thread.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks good!:

The SegmentReaderPool class should be package private.
You're still passing readOnly & closeDirectory into the new DirectoryIndexReader ctor; I think you can pass in only the writer, and then it can get all it needs from that (pool, dir, segmentInfos)?
Hmm... I think we should change commitMergedDeletes to use the already computed doc ID maps, instead of loading the old deletes. This way you don't need the old SR.
I think you don't need SegmentInfoKey? Can't you key off of just the SegmentInfo? I think for each segment name we only ever need one SR instance in the pool (dir is always the same)? (The readers we give out will make clones of these).
Why do you applyDeletesImmediately? I think we should buffer/flush like we normally do? That code shouldn't need to change.
I think getReader does need to flush, else added docs won't be seen in the reopend reader. You should call flush(true, true, true) – because you want to triggerMerges, flushDocStores, and flushDeletes.
Are segmentInfosList/printSegmentInfosList temporary debugging? Can you stick "nocommits" on them?

For SegmentReaderPool:

I think IW should always use the pool API internally to load & decRef the SRs it uses.
If getReader has never been called, then the pool should not hold refs, ie IW should behave exactly as it does today (get SR, use it, commit it, close it). Ie, the pool should detect on decRef that the RC is now 1, and release the SR.
If getReader has been called, then the pool should hold onto an SR as long as the writer's segmentInfos still references that segment. This means, after a merge commits we should prune the pool. Ie, we decRef and remove the SR from the pool, but we don't close it since a reader may still be using it. Maybe add a prune() method?
Somehow, we should not let the SR commit itself on close – that's up to IW to decide. If getReader has not been called then IW must commit deletes as soon as applies them (like today). Else, it's only on closing the pool that deletes are committed. EG pending deletes on SRs that have been merged can be freely discarded, since those deletes "made it" into the newly merged segment.
So I don't think we need a BG thread to do the culling.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I think we should change commitMergedDeletes to use the already computed doc ID maps, instead of loading the old deletes.

Where do I get the already computed doc ID maps?

I think getReader does need to flush, else added docs won't be seen in the reopend reader. You should call flush(true, true, true) - because you want to triggerMerges, flushDocStores, and flushDeletes.

I added a flushDeletesToDir flag that defaults to true except for IW.getReader.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Found the doc ID map in SegmentMerger in IW.commitMerge

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

                                    - uses the docIdMap from commitMerge SegmentMerger to map new deletes to the newly merged segment

            - applyDeletesImmediately

            - solved the decRef issue which appears in TestIndexWriterReader.testDeleteFromIndexWriter.  It doesn't show up in the debugger.

[junit] java.lang.AssertionError
    [junit]     at org.apache.lucene.index.SegmentReader$Ref.incRef(SegmentReader.java:103)
    [junit]     at org.apache.lucene.index.SegmentReader.incRef(SegmentReader.java:341)
    [junit]     at org.apache.lucene.index.IndexWriter$SegmentReaderPool.get(IndexWriter.java:647)
    [junit]     at org.apache.lucene.index.IndexWriter$SegmentReaderPool.getReadOnlyClone(IndexWriter.java:615)
    [junit]     at org.apache.lucene.index.MultiSegmentReader.<init>(MultiSegmentReader.java:86)
    [junit]     at org.apache.lucene.index.ReadOnlyMultiSegmentReader.<init>(ReadOnlyMultiSegmentReader.java:35)
    [junit]     at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:396)
    [junit]     at org.apache.lucene.index.TestIndexWriterReader.testDeleteFromIndexWriter(TestIndexWriterReader.java:159)
    [junit]     at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The patch is not committable as the debugging and comments need cleaning
The clone issue was hacked around so more tests pass
TestAtomicUpdate fails

asfimport commented 15 years ago

Jeremy Volkman (migrated from JIRA)

I noticed the comments about IW.getReader() flushing the current state of the writer, launching merges, etc. Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)

For added docs, yes. But deletions will carry straight through in memory.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Most tests pass. Because SegmentReaders are held by the writer, files remain open whereas before this patch they did not. This causes some tests to fail.
I'm working a way to remove unreferenced segments from the pool

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

In IW.checkpoint, SegmentReaderPool.closeUnusedSegmentReaders is called which gathers a set of all currently active segment names and closes the segment readers not in the set. When the IW is closed an assertion checks to insure the IW.segmentInfos names are the same as those in the pool.
Bunches of private variables were made package protected and need to be changed back

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

In TestConcurrentMergeScheduler.testNoWaitClose I'm seeing a couple of exceptions. I'm not quite sure what to make of them. The testNoWaitClose method calls IW.close(false) a number of times on an IW that uses ConcurrentMergeScheduler.

[junit] Caused by: java.lang.AssertionError
    [junit]     at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:553)
    [junit]     at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:470)
    [junit]     at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:481)
    [junit]     at org.apache.lucene.index.IndexWriter.decrefMergeSegments(IndexWriter.java:4485)
    [junit]     at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4473)
    [junit]     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4910)
    [junit]     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4539)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)

[junit] java.lang.AssertionError
    [junit]     at org.apache.lucene.index.IndexWriter.finishMerges(IndexWriter.java:3350)
    [junit]     at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2090)
    [junit]     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2045)
    [junit]     at org.apache.lucene.index.TestConcurrentMergeScheduler.testNoWaitClose(TestConcurrentMergeScheduler.java:211)
    [junit]     at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)

apache / lucene

Integrate IndexReader with IndexWriter [LUCENE-1516] #2590