apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.72k stars 1.04k forks source link

Integrate IndexReader with IndexWriter [LUCENE-1516] #2590

Closed asfimport closed 15 years ago

asfimport commented 15 years ago

The current problem is an IndexReader and IndexWriter cannot be open at the same time and perform updates as they both require a write lock to the index. While methods such as IW.deleteDocuments enables deleting from IW, methods such as IR.deleteDocument(int doc) and norms updating are not available from IW. This limits the capabilities of performing updates to the index dynamically or in realtime without closing the IW and opening an IR, deleting or updating norms, flushing, then opening the IW again, a process which can be detrimental to realtime updates.

This patch will expose an IndexWriter.getReader method that returns the currently flushed state of the index as a class that implements IndexReader. The new IR implementation will differ from existing IR implementations such as MultiSegmentReader in that flushing will synchronize updates with IW in part by sharing the write lock. All methods of IR will be usable including reopen and clone.

magnetic.png

ssd.png

ssd2.png


Migrated from LUCENE-1516 by Jason Rutherglen, resolved Apr 09 2009 Attachments: LUCENE-1516.patch (versions: 28), magnetic.png, ssd.png, ssd2.png

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

It looks like DirectoryIndexReader needs to have an IndexWriter mode as unfortunately subclassing won't work. In this context overriding a method implies the IW mode is being used.

I assume we'll share the segmentInfos object from IW rather than share a clone with the IR?

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

It still seems to me that we should think about what objects need to be shared and how they can be produced/instantiated appropriately instead of adding a Reader onto the Writer, which, IMO, pollutes the Writer API. I know it complicates things, but I think it will be less confusing to users of the API.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Patch includes the latest patch from #2391 and is unfortunately not readable. Is there a way to create a patch minus another patch?

The IW internal reader is always the results of the latest external IR.reopen. In this way deletes made in IW and from the external IR remain in sync. The latest IR always has the write lock anyways according to IR.clone semantics.

DocumentsWriter.applyDeletes reuses the reader in IW rather than call SegmentReader.get/.deleteDocument

In IW.doFlush the internal reader is reopened with the new segmentinfos

Included a very basic test case

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

This is an extremely early patch and has a bunch of System.outs for debugging

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Some terminology for this patch, an internal reader is the IndexWriter's reader. An external reader given via the IW.reopenReader method.

If DirectoryIndexReader has a writer, no lock is acquired on updates. IR.clone normally passes the writeLock to the new reader, however the external IR and the IW internal IR both need to hold the write lock. For this reason the user must be careful when flushing to insure the proper instance of the IR's deletes are merged with the writer.

The external IR.flush does not flush the deletes to disk, instead it merges with the IW's internal IR which is in RAM. IW.flush causes deletes and new segments to be flushed to the directory.

The test cases from TestIndexWriterReader testIndexWriterDeletes and testIndexWriterReopenSegment fail when the IW is opened again after commit and close. The index files are being deleted during IW.commit. I traced this to IW.finishCommit -> deleter.checkpoint -> deleter.deletePendingFiles.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Added TestIndexWriterReader.testIndexWriterDeletes, testIndexWriterDeletesOptimize, testIndexWriterReopenSegment, testIndexWriterReopenSegmentOptimize. Where the optimize methods fail, the non optimize ones work. The optimize methods delete _0.fdt and _0.fdx and so fail when the writer is created again because it cannot find those files. It could be a segment infos merging problem or something else.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The previously mentioned issue in the test case is fixed. Perhaps IW.reopenInternalReader() should be called in IW.checkpoint?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks good, Jason. This is big change, and I expect to go through a number of iterations before settling... plus we still need to figure out how the API is exposed. Comments:

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Mike, good points...

since you can't predict when IW materializes deletes, your reader will suddenly see a bunch of deletes appear.

The reader would need to be reopened to see the deletes. Isn't that expected behavior?

Instead of having to merge readers, I think we need a single source to obtain an SR from

I like this however how would IR.clone work? I like having the internal reader separate from the external reader. The main reason to expose IR from IW is to allow delete by doc id and norms updates (eventually column stride fields updates). I don't see how we can grab a reader during a merge, and block realtime deletes occurring on the external reader. However it is difficult to rectify deletes to an external SR that's been merged away.

It seems like we're getting closer to using a unique long UID for each doc that is carried over between merges. I was going to implement this above LUCENE-1516 however we may want to make UIDs a part of LUCENE-1516 to implement the behavior we're discussing.

If the updates to SR are queued, then it seems like the only way to achieve this is a doc UID. This way merges can happen in the background, the IR has a mechanism for mapping it's queue to the newly merged segments when flushed. Hopefully we aren't wreaking havoc with the IndexReader API?

The scenario I think we're missing is if there's multiple cloned SRs out there. With the IW checkout an SR model how do we allow cloning? A clone's updates will be placed into a central original SR queue? The queue is drained automatically on a merge or IW.flush? What happens when we want the IR deletes to be searchable without flushing to disk? Do a reopen/clone?

number of iterations before settling

Agreed, if it were simple it wouldn't be fun. ☺

It's not necessary for IW to write new .del files when it materializes deletes.

Good point, DocumentsWriter.applyDeletes shouldn't be flushing to disk and this sounds like a test case to add to TestIndexWriterReader.

IW.reopenInternalReader only does a clone not a reopen; however does it cover the newly flushed segment?

The segmentinfos is obtained from the Writer. In the test case testIndexWriterReopenSegment it looks like using clone reopens the new segments.

I think it's better if no deletes appear, ever, until you reopen your reader. Maybe we simply prevent deletion through the IR?

Preventing deletion through the IR would seem to defeat the purpose of the patch unless there's some alternative mechanism for deleting by doc id?

commitMergedDeletes to decouple computing the new BitVector from writing the .del file to disk.

A hidden method I never noticed. I'll keep it in mind.

It seems like reader.reopen() (where reader was obtained with IW.getReader()) doesn't do the right thing? (ie it's looking for the most recent segments_N in the Directory, but it should be looking for it @ IW.segmentInfos).

Using the reopen method implementation for a Reader with IW does not seem necessary. It seems like it could call clone underneath?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> since you can't predict when IW materializes deletes, your reader > will suddenly see a bunch of deletes appear.

The reader would need to be reopened to see the deletes. Isn't that expected behavior?

Ahh right, so long as we keep internal (private) clone, materializing the deletes won't affect the external reader.

> Instead of having to merge readers, I think we need a single > source to obtain an SR from

I like this however how would IR.clone work?

It should work fine? The single source would only be used internally by IW (for merging, for materializing deletes, for the internal reader).

I like having the internal reader separate from the external reader.

I think we should keep that separation.

The main reason to expose IR from IW is to allow delete by doc id and norms updates (eventually column stride fields updates). I don't see how we can grab a reader during a merge, and block realtime deletes occurring on the external reader. However it is difficult to rectify deletes to an external SR that's been merged away.

It seems like we're getting closer to using a unique long UID for each doc that is carried over between merges. I was going to implement this above LUCENE-1516 however we may want to make UIDs a part of LUCENE-1516 to implement the behavior we're discussing.

If the updates to SR are queued, then it seems like the only way to achieve this is a doc UID. This way merges can happen in the background, the IR has a mechanism for mapping it's queue to the newly merged segments when flushed. Hopefully we aren't wreaking havoc with the IndexReader API?

But... do we need delete by docID once we have realtime search? I think the last compelling reason to keep IR's delete by docID was immediacy, but realtime search can give us that, from IW, even when deleting by Term or Query?

(Your app can always add that long UID if it doesn't already have something usable).

docIDs are free to changing inside IW. I don't see how we can hand out a reader, allow deletes by docID to it, and merge those deletes back in at a later time, unless we track the genealogy of the segments?

The scenario I think we're missing is if there's multiple cloned SRs out there. With the IW checkout an SR model how do we allow cloning? A clone's updates will be placed into a central original SR queue? The queue is drained automatically on a merge or IW.flush? What happens when we want the IR deletes to be searchable without flushing to disk? Do a reopen/clone?

This is why I think all changes must be done through IW if you've opened a reader from it. In fact, with the addition of realtime search to Lucene, if we also add updating norms/column-stride fields to IW, can't we move away from allowing any changes via IR? (Ie deprecate deleteDocuments/setNorms/etc.)

> It's not necessary for IW to write new .del files when it > materializes deletes.

Good point, DocumentsWriter.applyDeletes shouldn't be flushing to disk and this sounds like a test case to add to TestIndexWriterReader.

Well, if IW has no persistent reader to hold the deletes, it must keep doing what it does now (flush immediately to disk)?

> IW.reopenInternalReader only does a clone not a reopen; however > does it cover the newly flushed segment?

The segmentinfos is obtained from the Writer. In the test case testIndexWriterReopenSegment it looks like using clone reopens the new segments.

Wait, where is this test? Maybe you need to svn add it?

And, clone should not be reopening segments...?

> I think it's better if no deletes appear, ever, until you reopen > your reader. Maybe we simply prevent deletion through the IR?

Preventing deletion through the IR would seem to defeat the purpose of the patch unless there's some alternative mechanism for deleting by doc id?

See above.

> commitMergedDeletes to decouple computing the new BitVector from > writing the .del file to disk.

A hidden method I never noticed. I'll keep it in mind.

It's actually very important. This is how IW allows deletes to materialize to docIDs, while a merge is running – any newly materialized deletes against the just-merged segments are coalesced and carried over to the newly created segment. Any further deletes must be done against the docIDs in the new segment (which is why I don't see how we can allow deletes by docID to happen against a checked out reader).

> It seems like reader.reopen() (where reader was obtained with > IW.getReader()) doesn't do the right thing? (ie it's looking for the > most recent segments_N in the Directory, but it should be looking for > it @ IW.segmentInfos).

Using the reopen method implementation for a Reader with IW does not seem necessary. It seems like it could call clone underneath?

Well, clone should be very different from reopen. It seems like calling reader.reopen() (on reader obtained from writer) should basically do the same thing as calling writer.getReader(). Ie they are nearly synonyms? (Except for small difference in ref counting – I think writer.getReader() should always incRef, but reopen only incRefs if it returns a new reader).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Added the test case to the patch.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

The path forward seems to be exposing a cloned readonly reader from IW.getReader. This would be easier than doing hula hoops to do segment genealogy (at least for now ☺)

can't we move away from allowing any changes via IR? (Ie deprecate deleteDocuments/setNorms/etc.)

This would simplify things however as a thought experiment how would the setNorms work if it were a part of IndexWriter?

And, clone should not be reopening segments...?

DirectoryIndexReader.clone(boolean openReadonly) calls doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly) which is an abstract method that in SegmentReader and MultiSegmentReader reopens the segments? The segment infos for a ReaderIW is obtained from IW, which is how it knows about the new segments. Perhaps not desired behavior?

do we need delete by docID once we have realtime search? I think the last compelling reason to keep IR's delete by docID was immediacy, but realtime search can give us that, from IW, even when deleting by Term or Query?

Good point! I think we may want to support it but for now it's shouldn't be necessary. I'm thinking of the case where someone is using the field cache (or some variant), performs some sort of query on it and then needs to delete based on doc id. What do they do? Would we expose a callback mechanism where a deleteFrom(IndexReader ir) method is exposed and deletes occur at the time of the IW's choosing?

It seems like calling reader.reopen() (on reader obtained from writer) should basically do the same thing as calling writer.getReader(). Ie they are nearly synonyms? (Except for small difference in ref counting - I think writer.getReader() should always incRef, but reopen only incRefs if it returns a new reader).

Perhaps ReaderIW.reopen will call IW.getReader underneath instead of using IR's usual mechanism.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

This latest patch doesn't seem to apply.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Is there ever a need for the normal IR construction anymore? Or do we always just ask for it from the IW (or wherever we choose to expose this, as I still don't think it belongs on the IW API wise, but that isn't a big deal right now) every time? I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right?

API wise, I think we could do something like (with obvious other variations):

IndexAccessor{

  IndexWriter getWriter(Directory);

  //returns read-only reader
  IndexReader getReader(Directory);

  //returns the external IR described above
  IndexReader.getReader(IndexWriter);
}

Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Jason, I think you need to "svn up". Or, tell us which revision you're on and we can downgrade to that revision before applying the patch. (We need "svn patch"!).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The path forward seems to be exposing a cloned readonly reader from IW.getReader.

+1

> can't we move away from allowing any changes via IR? (Ie > deprecate deleteDocuments/setNorms/etc.)

This would simplify things however as a thought experiment how would the setNorms work if it were a part of IndexWriter?

I think it'd look like this?

IndexWriter.setNorm(Term term, String field, byte norm)

Ie the Term IDs the doc(s) you want to set the norm for.

> And, clone should not be reopening segments...?

DirectoryIndexReader.clone(boolean openReadonly) calls doReopen(SegmentInfos infos, boolean doClone, boolean openReadOnly) which is an abstract method that in SegmentReader and MultiSegmentReader reopens the segments? The segment infos for a ReaderIW is obtained from IW, which is how it knows about the new segments. Perhaps not desired behavior?

OK, I think it does not reopen existing segments. Meaning, if a segment is in common w/ old and new, it truly clones it (does not reopen norms nor del). But if there is a new segment that did not exist in old, it opens a whole new segment reader? I'll commit an assert that this doesn't happen – if caller passes in "doClone=true" then caller should not have passed in a segmentInfos with changes? Else the reader is on thin ice (mismatch what's in RAM vs what SegmentInfo says).

> do we need delete by docID once we have realtime search? I > think the last compelling reason to keep IR's delete by docID was > immediacy, but realtime search can give us that, from IW, even when > deleting by Term or Query?

Good point! I think we may want to support it but for now it's shouldn't be necessary. I'm thinking of the case where someone is using the field cache (or some variant), performs some sort of query on it and then needs to delete based on doc id. What do they do? Would we expose a callback mechanism where a deleteFrom(IndexReader ir) method is exposed and deletes occur at the time of the IW's choosing?

Wouldn't delete-by-Query cover this? Ie one could always make a Filter implementing the "look @ field cache, do some logic, provide docIDs to delete", wrap as Query, then delete-by-Query?

> It seems like calling reader.reopen() (on reader obtained > from writer) should basically do the same thing as calling > writer.getReader(). Ie they are nearly synonyms? (Except for small > difference in ref counting - I think writer.getReader() should always > incRef, but reopen only incRefs if it returns a new reader).

Perhaps ReaderIW.reopen will call IW.getReader underneath instead of using IR's usual mechanism.

Right, that's what I'm thinking. Once you've obtained reader coupled to a writer, you can then simply reopen it whenever you want to see (materialize) changes done by the writer.

We still need a solution for the "warm the just merged segment"... else we will not be realtime, especially when big merge finishes. It seems like after merge finishes, it should immediately 1) open a SegmentReader on the new segment, 2) invoke the method you passed in (or you subclassed – not sure which), 3) carry over deletes that materialized during the merge, 4) commit the merge (replace old segments w/ new one).

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I suppose if I know I'm not going to be changing my index, I can still just get a read-only IR, right?

Right, I think we still want to allow opening a standalone (uncoupled) reader.

Then, everyone has a single point of entry for both writers and readers and all of this stuff can just be done through package private methods on the IW and it allows us to change things if we decide otherwise and it means that the IW is not coupled with the IR publicly.

I'm torn... the IndexAccessor would need to expose many variants to carry over all the options we now have (String or File or Directory, IndexCommit or not, IndexDeletionPolicy or not, create or not). It will end up exposing a number of new methods... and, would it try to be "smart" (like IndexModifier, and the LuceneIndexAccessor class in LUCENE-390), keeping track of references to the readers it's handed out, etc.? Or is it simply a pass-through to the underlying open/ctors we have today?

The alternative (as of right now, unless we are missing something further with these changes) is adding one method to IndexWriter, getReader, that returns a readOnly IndexReader, "coupled" to the writer you got it from in that it's able to search un-committed changes and if you reopen it, writer will materialize all changes and make them visible to the reopened reader.

I guess so far I don't really see why this small (one method) API change merits a switch to a whole new accessor API for creating readers & writers on an index? Maybe there is a straw-that-breaks-the-camel's-back argument that I'm missing...

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Good points, MIke, but maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory.

new IndexWriter(new Directory(indexFile));

Additionally, there are no more variants than there already are on the IW and IR, right?

As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that.

As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now. You wouldn't ask FileWriter for a FileReader, would you? I don't see why it would be good to ask a IW for an IR, API-wise (I get why we are doing this, it makes sense).

Likewise, isn't it just as logical to ask for an IW from an IR? If I have an IR already and I want it to be aware of the writes I want to do, why wouldn't we then add IR.getIW()? And then we can have total circular dependency.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

maybe we don't need all those variants? String, File and Directory are all easily enough collapsed down to just Directory.

new IndexWriter(new Directory(indexFile));

(You'd presumably need to close that Directory). But, yeah, we may be able to drop some of them, although I do think they are convenient for new users of Lucene. And forcing users to switch to a totally new yet pass through API on ugprade, but not giving them one to one carryover, is not very nice.

Additionally, there are no more variants than there already are on the IW and IR, right?

Right, I'm just saying IndexAccessor will have many methods. And then you're asking every app to make this switch, on upgrade. It's alot of API swapping/noise vs a single added expert method to IW.

As for pass-through or not, I think it would just pass-through, at least initially, but it certainly leaves open the possibility for reference counting in the future if someone wants to implement that.

If we think it'll be more than just pass through, we should try to hash out, somewhat, what it will & won't do up front (changing it later is a big change)? And we should start from #1468.

As someone who teaches people these APIs on a regular basis, I feel pretty confident in saying that adding an IR to the IW as a public API is going to confuse a good chunk of people just as the delete stuff on the IR currently does now.

But this will be an expert/advanced API, a single added method to IW. I wouldn't expect users to be confused: on upgrade I think most users will not even notice its existence!

You wouldn't ask FileWriter for a FileReader, would you?

I'm not sure that's the right comparison – Lucene's IW does far more than a FileWriter. And the fact that Lucene allows "point in time" searching (which is very useful and rather unique) is a very big difference vs FileReader/Writer.

Likewise, isn't it just as logical to ask for an IW from an IR?

I don't think so: the functionality is not symmetric, because Lucene allows only one writer open at a time, but many readers (eg on different commits). Since a writer is the one making changes, it makes sense that you'd ask it, right now, to give you a reader reflecting all changes up to that point. And call it again later to get a reader seeing changes after that, etc.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Or, here's an idea: can we do both? Put IndexAccessor as an optional "convenience" layer that simplifies the ctors and expert methods of IW & IR, but leave public direct access to the ctros & expert methods? This way on upgrade nobody is forced to migrate to an entirely new yet simply pass-through API?

Or another idea is to decouple these two discussions: go ahead and add the single expert method to IW, but as a separate discussion/JIRA work out how we can simplify overall access/management of IR/IW instances?

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Ah yes, patch from the old directory that need deleting. Here's the correct one. Sorry about that.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Right, I'm just saying IndexAccessor will have many methods. And then

you're asking every app to make this switch, on upgrade. It's alot of API swapping/noise vs a single added expert method to IW.

Sure, but that is already the case w/ IW/IR anyway.

I agree about the short term noise, but in the long run it seems cleaner.

But this will be an expert/advanced API, a single added method to IW.

I wouldn't expect users to be confused: on upgrade I think most users will not even notice its existence!

Hmm, I don't agree, but I guess it depends on the performance hit. If given a choice between the semantics of a reader that sees changes as they are made versus having to do the whole reopen thing, I'm betting most users will say "duh, I want to see my changes right away" and choose the IR that is synced w/ the IW, b/c that is what people think is the logical thing to happen and it is how DBs work, which many devs. are used to. As an app developer, if I don't have to think about IR lifecycle management, why would I want to as long as it performs? What this patch is offering, AFAICT, is the removal of IR lifecycle managment from the user.

In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert". Now, if you're telling me this is going to be significantly slower even when updates are rare, then maybe I would stick to the current lifecycle, but if there isn't much difference, I'll take the one that pushes the lifecycle complexity down into Lucene instead of in my app.

asfimport commented 15 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

a reader that sees changes as they are made versus having to do the whole reopen thing

It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's hard keeping up with the current proposal in big issues/threads, but I don't think anyone is proposing a reader that automatically sees changes... i.e. the view of an IndexReader instance will still be fixed.

That's right. The current proposal is to add one method to IW:

IndexReader getReader()

that returns a point-in-time view of the index plus all changes buffered in IW up until that point. Then you can reopen that reader (or call getReader() again, which does the same thing) to quickly get a new point-in-time reader.

I think the point-in-time semantics is important to keep.

Also, you can't easily emulate point-in-time if we implemented the "live" approach, but you can easily do vice/versa (assuming we can keep reopen() time fast enough).

EG the IndexAccessor convenience layer could do automatic reopening so that when you ask it for the reader it always reopens it; this would emulate "live updates" and hide the lifecycle management.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I guess it depends on the performance hit.

It's challenging to implement truly live updates w/ decent performance: I think we'd need to build the reader impl that can search DocumentsWriter buffer.

Whereas the approach (patch) here is actually quite simple (all the hard work was already done – IndexReader.reopen, collection/sorting/filtering by segment, etc.).

In other words, my guess is that over time, as the performance proves out, it will be the default choice, not "expert".

I agree: realtime search will likely be a popular feature once we finish it, release it, it proves stable, performant, etc. Eventually (maybe soon) it should be made the default.

I think IndexAccessor makes alot of sense, but it's a big change and I'd rather not couple it to this issue. There are many questions to be hashed out (under a new issue): is it a simple pass-through? Or does it manage the lifecycle of the readers for you? Does it warm new readers? Should it emulate "live" update semantics? Should getReader get it from the writer if there is one (ie, making realtime search the "default")? Etc.

asfimport commented 15 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

OK, I agree. Let's just mark it as expert/subject to revision and then we're good.

We can revisit IndexAccessor separately.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

There's concurrency issues to work out.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Caused by: java.io.IOException: read past EOF
    [junit]     at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:135)
    [junit]     at org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:92)
    [junit]     at org.apache.lucene.store.IndexOutput.copyBytes(IndexOutput.java:172)
    [junit]     at org.apache.lucene.index.TermVectorsWriter.addRawDocuments(TermVectorsWriter.java:185)
    [junit]     at org.apache.lucene.index.SegmentMerger.mergeVectors(SegmentMerger.java:447)
    [junit]     at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:145)
    [junit]     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4823)
    [junit]     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4408)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)
asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think your latest test is missing the the new unit test source.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Woops, sorry – I downloaded the wrong patch.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we can simplify the approach here:

Other things:

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

> make a new SegmentReaderPool private class

I'd prefer the SegmentReaderPool model over the patch's existing one as it is simpler and closer to how the underlying system actually works meaning it works directly with the segments in a systematized way.

> We would never mergeSegmentInfos, never ask the "internal reader" to commit

Good, merging the segmentInfos is confusing and tricky to debug

> Since IndexReader will always be readonly, you can simplify the new DirectoryIndexReader.open method

+1

> why we need merge.segmentReadersClone?

I was modeling after the segmentInfosClone. If it's not necessary I'll happily remove it.

> I think the only reason for reader to hold a reference to writer is so that on reopen, the reader realizes it was created from IW.getReader and simply forwards the request to IW

+1

> Wouldn't delete-by-Query cover this? Ie one could always make a Filter implementing the "look @ field cache, do some logic, provide docIDs to delete", wrap as Query, then delete-by-Query? (from a previous coment)

Yes that will work.

How are these sample SegmentReaderPool method signatures?

private class SegmentReaderPool {
  public SegmentReader get(SegmentInfo si) {}

  public void decRef(SegmentReader sr) {}
}
asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

In SegmentInfo it's hashCode only takes into account the name and directory, why not the delete generation?

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

How are these sample SegmentReaderPool method signatures?

Looks good for starters. decRef would simply pass-through to the SR's normal decRef, but would then also decRef its copy if the ref count has dropped to 1?

In SegmentInfo it's hashCode only takes into account the name and directory, why not the delete generation?

Two SegmentInfo instances are considered equal if the dir is == and the name is .equals, so hashCode must match that (be the same value when equals returns true).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Jason could you rebase the patch on current trunk? (Or post the revision you're on?).

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

We'll need a method to check the SegmentReaderPool for readers that may be removed. It seems it would check IW.segmentInfos, pending merges, current merges for segmentinfo(s) that are in use, then all others in the pool are expunged. I'm not sure if this method would need to run in a background thread.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Looks good!:

For SegmentReaderPool:

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

I think we should change commitMergedDeletes to use the already computed doc ID maps, instead of loading the old deletes.

Where do I get the already computed doc ID maps?

I think getReader does need to flush, else added docs won't be seen in the reopend reader. You should call flush(true, true, true) - because you want to triggerMerges, flushDocStores, and flushDeletes.

I added a flushDeletesToDir flag that defaults to true except for IW.getReader.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

Found the doc ID map in SegmentMerger in IW.commitMerge

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

                                    - uses the docIdMap from commitMerge SegmentMerger to map new deletes to the newly merged segment

            - applyDeletesImmediately

            - solved the decRef issue which appears in TestIndexWriterReader.testDeleteFromIndexWriter.  It doesn't show up in the debugger.
[junit] java.lang.AssertionError
    [junit]     at org.apache.lucene.index.SegmentReader$Ref.incRef(SegmentReader.java:103)
    [junit]     at org.apache.lucene.index.SegmentReader.incRef(SegmentReader.java:341)
    [junit]     at org.apache.lucene.index.IndexWriter$SegmentReaderPool.get(IndexWriter.java:647)
    [junit]     at org.apache.lucene.index.IndexWriter$SegmentReaderPool.getReadOnlyClone(IndexWriter.java:615)
    [junit]     at org.apache.lucene.index.MultiSegmentReader.<init>(MultiSegmentReader.java:86)
    [junit]     at org.apache.lucene.index.ReadOnlyMultiSegmentReader.<init>(ReadOnlyMultiSegmentReader.java:35)
    [junit]     at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:396)
    [junit]     at org.apache.lucene.index.TestIndexWriterReader.testDeleteFromIndexWriter(TestIndexWriterReader.java:159)
    [junit]     at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)
asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Jeremy Volkman (migrated from JIRA)

I noticed the comments about IW.getReader() flushing the current state of the writer, launching merges, etc. Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Does this mean that the point of interaction between index writes and reads will still be the Directory? (i.e. the disk)

For added docs, yes. But deletions will carry straight through in memory.

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

asfimport commented 15 years ago

Jason Rutherglen (migrated from JIRA)

[junit] Caused by: java.lang.AssertionError
    [junit]     at org.apache.lucene.index.IndexFileDeleter$RefCount.DecRef(IndexFileDeleter.java:553)
    [junit]     at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:470)
    [junit]     at org.apache.lucene.index.IndexFileDeleter.decRef(IndexFileDeleter.java:481)
    [junit]     at org.apache.lucene.index.IndexWriter.decrefMergeSegments(IndexWriter.java:4485)
    [junit]     at org.apache.lucene.index.IndexWriter.commitMerge(IndexWriter.java:4473)
    [junit]     at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4910)
    [junit]     at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4539)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:218)
    [junit]     at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:274)
[junit] java.lang.AssertionError
    [junit]     at org.apache.lucene.index.IndexWriter.finishMerges(IndexWriter.java:3350)
    [junit]     at org.apache.lucene.index.IndexWriter.closeInternal(IndexWriter.java:2090)
    [junit]     at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:2045)
    [junit]     at org.apache.lucene.index.TestConcurrentMergeScheduler.testNoWaitClose(TestConcurrentMergeScheduler.java:211)
    [junit]     at org.apache.lucene.util.LuceneTestCase.runTest(LuceneTestCase.java:88)