IndexReader.reopen() [LUCENE-743]

asfimport commented 17 years ago

This is Robert Engels' implementation of IndexReader.reopen() functionality, as a set of 3 new classes (this was easier for him to implement, but should probably be folded into the core, if this looks good).

Migrated from LUCENE-743 by Otis Gospodnetic (@otisg), 3 votes, resolved Nov 17 2007 Attachments: IndexReaderUtils.java, lucene-743.patch (versions: 3), lucene-743-take10.patch, lucene-743-take2.patch, lucene-743-take3.patch, lucene-743-take4.patch, lucene-743-take5.patch, lucene-743-take6.patch, lucene-743-take7.patch, lucene-743-take8.patch, lucene-743-take9.patch, MyMultiReader.java, MySegmentReader.java, varient-no-isCloneSupported.BROKEN.patch Linked issues:

2106
- 1906
- 2062

asfimport commented 17 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

In a direct email to me, Robert said: "All of the files can be prepended with the ASL."

asfimport commented 17 years ago

robert engels (migrated from JIRA)

A generic version probably needs to implement reference counting on the Segments or IndexReader in order to know when they can be safely closed.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

i somehow missed seeing this issues before ... i don't really understand the details, but a few comments that come to mind...

1) this approach seems to assume that when reopening a MyMultiReader, the sub readers will all be MySegmentReaders .. assuming we generalize this to MultiReader/SegmentTeader, this wouldn't work in the case were people are using a MultiReader containing other MultiReaders ... not to mention the possibility of people who have written their own IndexReader implementations. in generally we should probably try to approach reopening a reader as a recursive operation if possible where each type of reader is responsible for checking to see if it's underlying data has changed, if not return itself, if so return a new reader in it's place (much like rewrite works for Queries)

2) there is no more commit lock correct? ... is this approach something that can still be valid using the current backoff/retry mechanism involved with opening segments?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> i somehow missed seeing this issues before ...

actually, me too... first I came across this thread:

http://www.gossamer-threads.com/lists/lucene/java-dev/31592?search_string=refresh;#31592

in which Doug suggested adding a static method IndexReader.open(IndexReader) which would either return the passed in IndexReader instance in case is is up to date or return a new, refreshed instance.

I started implementing this, using Dougs and Roberts ideas and then realized that there was already this open issue. But the code here is quite outdated.

> in generally we should probably try to approach reopening a reader as a > recursive operation

Yeah we could do that. However, this might not be so easy to implement. For example, if a user creates a MultiReader instance and adds whatever subreaders, we would have to recursively refresh the underlying readers. But if the MultiReader was created automatically by IndexReader.open() just calling refresh on the subreaders is not enough. New SegmentReaders have to be opened for new segments.

Also the recursive walk would have to take place within the FindSegmentsFile logic.

I decided therefore to only allow IndexReaders to be refreshed if they were created by one of the IndexReader.open() methods. I'm going to submit a first version of my patch soon. Do you think this is too limiting?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

First version of my patch:

Adds the static method IndexReader.open(IndexReader) that returns a new instance of IndexReader in case the reader could be updated. If it was up to date then the passed-in instance is returned. Only IndexReader instances that were created by one of the IndexReader.open() methods can be refreshed.
SegmentsReader.reopen(SegmentInfos) looks in the SegmentInfos for a segment with the same name. If one could be found then either the deleted docs or the norms were updated, otherwise the segment name would have changed. reopen() clones the SegmentReader and either loads the deleted docs, the norms or both. Then the clone is returned and the original SegmentReader is marked as closed.
If the index has only one segment, then IndexReader.open(IndexReader) checks if the passed in IndexReader can be refreshed. This is only possible if it is no MultiReader and if the segment name has not changed. Otherwise a new SegmentReader instance is returned and the old reader is closed.
If the index has multiple segments, then IndexReader.open(IndexReader) creates a new MultiReader and tries to reuse the passed-in reader (in case it's a SegmentReader) or its subreaders (in case it's a MultiReader). For new segments new SegmentReaders are created. Old readers that couldn't be reused are closed.
Adds the new testcase TestIndexReaderReopen. It includes the method assertIndexEquals(IndexReader, IndexReader) that checks whether boths IndexReaders have the same content. The testcase creates an index and performes different modifications on that index. One IndexReader is refreshed after each index modification and compared to a fresh IndexReader which is opened after each modification.

This first version is for review and not ready to commit. I want to add more extensive tests and probably clean up the code.

All tests pass.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

> Yeah we could do that. However, this might not be so easy to implement. > For example, if a user creates a MultiReader instance and adds whatever > subreaders, we would have to recursively refresh the underlying readers. > But if the MultiReader was created automatically by IndexReader.open() just > calling refresh on the subreaders is not enough. New SegmentReaders have to > be opened for new segments.

...this being the curse that is MultiReader – it can serve two very differenet purposes.

You seem to have already solved the multisegments in a single directory approach, the MultiReader over many subreader part actually seems much easier to me (just call your open method on all of the subreaders) the only tricky part is detecting which behavior should be used when. This could be driven by a simple boolean property of MultiReader indicating whether it owns it's directory and we need to look for new segments or not – in which case we just need to refresh the subreaders. (My personal preference would be to change MultiReader so "this.directory" is null if it was open over several other subReaders, right now it's just assigned to the first one arbitrarily, but there may be other consequences of changing that)

Incidentally: I don't think it's crucial that this be done as a recursive method. the same approach i describe could be added to static utility like what you've got, I just think that if it's possible to do it recursively we should so that if someone does write their own MultiReader or SegmentReader subclass they can still benefit from any core reopening logic as long as theey do their part to "reopen" their extensions.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

an extremely hackish refactoring of the previous patch that demonstrates the method working recursively and dealing with MultiReaders constructed over multiple subReaders.

a few notes:

1) no where near enough tests of the subReader situation 2) the refactoring is very very ugly and brutish ... most of the code is still in IndexReader just because it needs so many things that are private – things that really seems like they should be pushed down into SegmentReader (or made protected) 3) test triggered an apparent NPE in MultiReader.isOptimized() when there are subReaders, i hacked arround this in the test, see usages of assertIndexEqualsZZZ vs assertIndexEquals 4) the FilterIndexReader situation is ... interesting. in theory FilterIndexReader should really be abstract (since if you aren't subclassing it anyway, why do you want it?)

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Now, after #1856, #2046 and #1907 are committed, I updated the latest patch here, which was now easier because MultiReader is now separated into two classes.

Notes:

As Hoss suggested I added the reopen() method to IndexReader non-static.
MultiReader and ParallelReader now overwrite reopen() to reopen the subreaders recursively.
FilteredReader also overwrites reopen(). It checks if the underlying reader has changed, and in that case returns a new instance of FilteredReader.

I think the general contract of reopen() should be to always return a new IndexReader instance if it was successfully refreshed and return the same instance otherwise, because IndexReaders are used as keys in caches. A remaining question here is if the old reader(s) should be closed then or not. This patch closes the old readers for now, if we want to change that we probably have to add some reference counting mechanism, as Robert suggested already. Then I would also have to change the SegmentReader.reopen() implementation to clone resources like the dictionary, norms and delete bits. I think closing the old reader is fine. What do others think? Is keeping the old reader after a reopen() a useful usecase?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I ran some quick performance tests with this patch:

1) The test opens an IndexReader, deletes one document by random docid, closes the Reader. So this reader doesn't have to open the dictionary or the norms. 2) Another reader is opened (or alternatively reopened) and one TermQuery is executed, so this reader has to read the norms and the dictionary.

I run these two steps 5000 times in a loop.

First run: Index size: 4.5M, optimized

1) + TermQuery: 103 sec
1) + 2) (open): 806 sec, so open() takes 703 sec
1) + 2) (reopen): 118 sec, so reopen() takes 15 sec ==> Speedup: 46.9 X

Second run: Index size: 3.3M, 24 segments (14x 230.000, 10x 10.000)

1) + TermQuery: 235 sec
1) + 2) (open): 1162 sec, so open() takes 927 sec
1) + 2) (reopen): 321 sec, so reopen() takes 86 sec ==> Speedup: 10.8X

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> I think closing the old reader is fine. What do others think? Is keeping the old > reader after a reopen() a useful usecase?

In a multi-threaded environment, one wants to open a new reader, but needs to wait until all requests finish before closing the old reader. Seems like reference counting is the only way to handle that case.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

(note: i haven't looked at the latest patch in detail, just commenting on the comments)

One key problem i see with automatically closing things in reopen is MultiReader: it's perfectly legal to do something like this psuedocode...

IndexReader a, b, c = ... MultiReader ab = new MultiReader({a, b}) MultiReader bc = new MultiReader({b, c}) ...b changes on disk... ab.reopen(); // this shouldn't affect bc;

one possibility would be for the semantics of reopen to close old readers only if it completely owns them; ie: MultiReader should never close anything in reopen, MultiSegmentReader should close all of the subreaders since it opened them in the first place ... things get into a grey area with SegementReader though.

In general i think the safest thing to do is for reopen to never close. Yonik's comment showcases one of the most compelling reasons why it can be important for clients to be able to keep using an old IndexReader instance, and it's easy enough for clients that want the old one closed to do something like...

IndexReader r = ... ... IndexReader tmp = r.reopen(); if (tmp != r) r.close(); r = tmp; ...

(one question that did jump out at me while greping the patch for the where old readers were being closed: why is the meat of reopen still in "IndexReader" with a "if (reader instanceof SegmentReader)" style logic in it? can't the different reopen mechanisms be refactored down into SegmentReader and MultiSegmentReader respectively? shouldn't the default impl of IndexReader throw an UnsuppportedOperationException?)

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> IndexReader a, b, c = ... > MultiReader ab = new MultiReader({a, b}) > MultiReader bc = new MultiReader({b, c}) > ...b changes on disk... > ab.reopen(); // this shouldn't affect bc; > > one possibility would be for the semantics of reopen to close old readers only > if it completely owns them; ie: MultiReader should never close anything in > reopen, MultiSegmentReader should close all of the subreaders since it opened > them in the first place

So if 'b' in your example is a MultiSegmentReader, then the reopen() call triggered from MultiReader.reopen() would close old readers, because it owns them, thus 'bc' wouldn't work anymore. So it depends on the caller of MultiSegmentReader.reopen() whether or not to close the subreaders. I think this is kind of messy. Well instead of reopen() we could add reopen(boolean closeOldReaders), but then...

> IndexReader r = ... > ... > IndexReader tmp = r.reopen(); > if (tmp != r) r.close(); > r = tmp; > ...

... is actually easy enough as you pointed out, so that the extra complexity is not really worth it IMO.

> In general i think the safest thing to do is for reopen to never close.

So yes, I agree.

> why is the meat of reopen still in "IndexReader" with a "if (reader instanceof > SegmentReader)" style logic in it? can't the different reopen mechanisms be > refactored down into SegmentReader and MultiSegmentReader respectively?

I'm not sure if the code would become cleaner if we did that. Sometimes a SegmentReader would then have to return a MultiSegmentReader instance and vice versa. So we'd probably have to duplicate some of the code in these two classes.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

> I'm not sure if the code would become cleaner if we did that. Sometimes a SegmentReader would then have to > return a MultiSegmentReader instance and vice versa. So we'd probably have to duplicate some of the code in > these two classes.

i don't hink there would be anything wrong with SegmentReader.reopen returning a MultiSegmentReader in some cases (or vice versa) but it definitely seems wrong to me for a parent class to be explicitly casing "this" to one of two know subclasses ... making reopen abstract in the base class (or throw UnsupportOp if for API back compatibility) seems like the only safe way to ensure any future IndexReader subclasses work properly.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

We should first refactor segmentInfos into IndexReader's subclasses.

asfimport commented 17 years ago

Testo Nakada (migrated from JIRA)

Please also consider making an option where the reopen can be automated (i.e. when the index is updated) instead of having to call it explicitly. Thread safety should be taken into account as well.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I'm attaching a new version of the patch that has a lot of changes compared to the last patch:

I factored most of the reopen logic into the subclasses of IndexReader. Now that we're having the DirectoryIndexReader layer this was possible in a more elegant way.
IndexReader.reopen() now does not close the old readers by default. This was somewhat tricky, because now the IndexReaders must be cloneable. I changed IndexReader to implement the Cloneable interface and implemented clone() for all Lucene built-in IndexReaders. However, there are probably custom IndexReader implementations out there that do not implement clone() and reopen() should throw an exception when an attempt is made to reopen such a reader. But I don't want to throw an exception in IndexReader.clone() itself, because then it would not be possible anymore for subclasses to recursively call the native Object.clone() via super.clone(). To solve this conflict I added the method {code:java} /**
Returns true, if this IndexReader instance supports the clone() operation. */ protected boolean isCloneSupported();


to IndexReader which returns false by default. IndexReader.clone() checks if the actual implementation supports clone() (i. e. the above method returns true). If not, it throws an UnsupportedOperationException, if yes, it returns super.clone().

I was not sure about whether to throw an (unchecked) UnsupportedOperationException or a CloneNotSupportedException in this case. I decided to throw UnsupportedOperationException even though it's not really following the clone() guidelines, because it avoids the necessity to catch the CloneNotSupportedException every time clone() is called (in the reopen() methods of all IndexReader implementations).

As an example for how the clone() method is used let me describe how MultiReader.reopen() works: it tries to reopen every of its subreaders. If at least one subreader could be reopened successfully, then a new MultiReader instance is created and the reopened subreaders are added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no index changes) is now cloned() and added to the new MultiReader.

- I also added the new method 
{code:java}
  /**
   * In addition to {`@link` #reopen()} this methods offers the ability to close
   * the old IndexReader instance. This speeds up the reopening process for
   * certain IndexReader implementations and reduces memory consumption, because
   * resources of the old instance can be reused for the reopened IndexReader
   * as it avoids the need of copying the resources.
   * <p>
   * The reopen performance especially benefits if IndexReader instances returned 
   * by one of the <code>open()</code> methods are reopened with 
   * <code>closeOldReader==true</code>.
   * <p>
   * Certain IndexReader implementations ({`@link` MultiReader}, {`@link` ParallelReader})
   * require that the subreaders support the clone() operation (see {`@link` #isCloneSupported()}
   * in order to perform reopen with <code>closeOldReader==false</code>.  
   */
  public synchronized IndexReader reopen(boolean closeOldReader);

As the javadoc says it has two benefits: 1) it speeds up reopening and reduces ressources, and 2) it allows to reopen readers, that use non-cloneable subreaders.

Please let me know what you think about these changes, especially about the clone() implementation.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I ran new performance tests with the latest patch similar to the tests I explained in an earlier comment on this issue.

I'm using again a 4.5M index. In each round I delete one document and (re)open the IndexReader thereafter. Here are the numbers for 5000 rounds:

Time	Speedup
open	703s
reopen(closeOldReader==false)	62s	11x
reopen(closeOldReader==true)	16s	44x

Now in each round I delete on document and also set the norm for one random document. Numbers for 1000 rounds:

Time	Speedup
open	166s
reopen(closeOldReader==false)	33s	5.0x
reopen(closeOldReader==true)	29s	5.7x

I think these numbers look pretty good. We get a quite decent speedup even if the old readers are not closed.

I would like to commit this in a couple of days to get ready for Lucene 2.3. It would be nice if someone could review the latest patch! Hoss? Others?

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

I think this looks pretty good Michael! Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> Too bad so much needs to be cloned in the case that closeOldReader==false... maybe someday in the future we can have read-only readers.

Yeah, the cloning part was kind of tedious. Read-only readers would indeed make our life much easier here. I'm wondering how many people are using the IndexReader to alter the norms anyway?

Well, thanks for reviewing the patch, Yonik!

asfimport commented 17 years ago

robert engels (migrated from JIRA)

Nice to see all the good work on this. We are still on a 1.9 derivative.

Hopefully we'll be able to move to stock 2.X release in the future.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

I haven't looked at the patch yet (i really really plan to this weekend, baring catastrophe) but i'm confused as to why you went the cloning route (along with the complexity of the api changes to indicate when it is/isn't supported) ... based on the comments, it seems to boil down to...

> As an example for how the clone() method is used let me describe how MultiReader.reopen() > works: it tries to reopen every of its subreaders. If at least one subreader could be reopened > successfully, then a new MultiReader instance is created and the reopened subreaders are > added to it. Every of the old MultiReader's subreaders, that was not reopened (because of no > index changes) is now cloned() and added to the new MultiReader.

that seems like circular logic: the clone method is used so that the sub readers can be cloned ?

why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?

And if (for reasons that aren't clear to me) it is important for MultiReader to use a clone of it's subreaders when their reopen method returns "this" then shouldn't clients do the same thing? ... why not make reopen always return this.clone() if the index hasn't changed (which now that i think about it, would also help by punting on the isCloneSupported issue – each class would already know if it was clonable.

maybe this will make more sense once i read the patch ... i just wanted to through it out there in case someone had a chance to reply before i get a chance.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> why use clones at all? why not just use the original reader (if the "index" that reader represents hasn't changed, why clone it?

Let's say you have a MultiReader with two subreaders:

IndexReader ir1 = IndexReader.open(index1);
IndexReader ir2 = IndexReader.open(index2);
IndexReader mr = new MultiReader(new IndexReader[] {ir1, ir2});

Now index1 changes and you reopen the MultiReader and keep the old one open:

IndexReader mr2 = mr.reopen();

ir1 would now be reopened and let's assume we wouldn't clone ir2. If you use mr2 now to e.g. delete a doc and that doc happens to be in index2, then mr1 would also see the changes because both MultiReaders share the same subreader ir2 and are thus not independent from each other.

> why not make reopen always return this.clone()

clone() might be an expensive operation. We only need to clone if at least one of the subreaders has changed.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> > Too bad so much needs to be cloned in the case that > > closeOldReader==false... maybe someday in the future we can have > > read-only readers. > > Yeah, the cloning part was kind of tedious. Read-only readers would > indeed make our life much easier here. I'm wondering how many people > are using the IndexReader to alter the norms anyway?

I think the closeOldReader=false case is actually quite important.

Because in production, I think you'd have to use that, so that your old reader stays alive and is used to service incoming queries, up until the point where the re-opened reader is "fully warmed".

Since fully warming could take a long time (minutes?) you need that old reader to stay open.

Can we take a copy-on-write approach? EG, this is how OS's handle the virtual memory pages when forking a process. This would mean whenever a clone has been made of a SegmentReader, they cross-reference one another. Then whenever either needs to alter deleted docs, one of them makes a copy then. Likewise for the norms.

This would mean that "read-only" uses of the cloned reader never pay the cost of copying the deleted docs bit vector nor norms.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Actually if we went back to the sharing (not cloning) approach, could we insert a check for any writing operation against the re-opened reader that throws an exception if the original reader is not yet closed?

In Michael's example above, on calling mr2.deleteDoc, you would hit an exception because mr2 would check and see that it's "original" reader mr is not yet closed. But once you've closed mr, then the call succeeds.

I think this would let us have our cake and eat it too: re-opening would be very low cost for unchanged readers (no clones created), and, you can still do deletes, etc, after you have closed your prior reader. You control when your prior reader gets closed, to allow for warming time and for queries to finish on the old reader.

Would this work?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

A couple other questions/points:

Just to verify: if you have a DirectoryIndexReader that is holding the write lock (because it has made changes to deletes/norms) then calling reopen on this reader should always return 'this', right? Because it has the write lock, it must be (better be!) current.

This might be a good place to add an assert: if you are not current, assert that you don't have the write lock, and if you have the write lock, assert that you are current.
I think you should add "ensureOpen()" at the top of MultiReader.reopen(...)?
If an Exception is hit during reopen, what is the resulting state of your original reader? I think, ideally, it is unaffected by the attempt to re-open? EG if you're on the hairy edge of file descriptor limits, and the attempt to re-open failed because you hit the limit, ideally your original reader is unaffected (even if you specified closeOldReader=true).

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

Okay, read the patch. I'm on board with the need for Clonable now ... it's all about isolation. if "r.reopen(false) == r" then the client is responsible for recognizing that it's got the same instance and can make the judgement call about reusing the instance or cloning depending on it's needs ... internally in things like MultiReader we have to assume we need a clone for isolation.

Questions and comments...

CompoundFileReader, FieldsReader, IndexReader, and BitVector all have clone methods where they silently catch and ignore CloneNotSupportedExceptions from super.clone()... if we don't expect the exception to ever happen, we should just let the exception propogate up the stack (there is no down side to declaring that clone() throws CloneNotSupportedException). If we think the exception might happen, but it's not a big deal if it does and we can ignore it, then we should put a comment in the catch block to that effect. i'm not clear which are the cases here.
i don't remember if the old patches had this issue as well, but having "return new FilterIndexReader(reopened);" in FilterIndexReader doesn't really help anyone using FilterIndexReader – they're going to want an instance of their own subclass. to meet the contract of FilterIndexReader, we should implement all "abstract" methods and delegate to the inner reader - so in theory do don't need a new instance of FIlterIndexReader, we could just return in.reopen(closeold) ... my gut says it would be better to leave the method out entirely so that the default impl which throws UnSupOpEx is used — that way subclasses who want to use reopen must define their own (instead of getting confused when their filtering logic stops working after they reopen for the first time)
instead of having an isClonable() method, why don't we just remove the "implements Clonable" declaration from IndexReader and put it on the concrete subclasses – then use "instanceof Cloneable" to determine if something is clonable? ... for that matter, the only place isCloneSupported is really used is in IndexReader.clone where an exception is thrown if Clone is not supported ... subclasses which know they are Clonable don't need this from the base class, subclasses which don't implement Clonable but are used in combination with core classes whose reopen method requires it could be skiped by the caller (since they don't implement Clonable) ..
javadocs frequently refer to "reopened successfully" and "refreshed successfully" when what it really means is "reopen() returned a newer/fresher reader" ... this may confuse people who are use to thinking of "successfully" a "without error"
random thought: are their any synchronization issues we're missing here? I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs. is there any chance for inconsistent state?

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

a rough variation on Michael's latest patch that makes the changes along two of the lines i was suggesting before reagrding FilterIndexReader and ising "instanceof Cloneable" instead of "isCloneSupported()" two important things to note:

1) i didn't change any of the documentation, i was trying to change as little aspossibel so the two patches could be compared side-by-side

2) this patch is currently broken. TestIndexReaderReopen gets an exception i can't understand ... but i have to stop now, and i wanted to post what i had in case people had comments.

...now that i've done this exercise, i'm not convinced that it's a better way to go, but it does seem like isCloneSupported isn't neccessary, that's the whole point of the Cloneable interface.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> I'm wondering about the case where once thread calls reopen while another thread is updating norms or deleting docs.

Hmmm there is cause for concern (and I should have had my mt-safe hat on :-) Reopen is synchronized on the reader, and so are norms access and docs, but from a quick look:

reopen() may be synchronized, but clone() called on sub-readers isn't in a synchronized context that the sub-reader cares about. For example, MultiReader.reopen has the lock on the multireader, but calles subreader.clone() which iterates over the norms in a non thread-safe way.
IndexInput objects that are in use should never be cloned... (like what is done in FieldsReader.clone())

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Thanks all for the reviews and comments!

There seem to be some issues/open questions concerning the cloning. Before I comment on them I think it would make sense to decide whether we want to stick with the cloning approach or not. Mike suggests this approach:

> Actually if we went back to the sharing (not cloning) approach, could > we insert a check for any writing operation against the re-opened > reader that throws an exception if the original reader is not yet > closed?

Interesting, yes that should work in case we have two readers (the original one and the re-opened one). But what happens if the user calls reopen twice to get two re-opened instances back? Then there would be three instances, and without cloning the two re-opened ones would also share the same resources. Is this a desirable use case or would it be okay to restrict reopen() so that it can only create one new instance?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> > Actually if we went back to the sharing (not cloning) approach, > > could we insert a check for any writing operation against the > > re-opened reader that throws an exception if the original reader > > is not yet closed? > > Interesting, yes that should work in case we have two readers (the > original one and the re-opened one). But what happens if the user > calls reopen twice to get two re-opened instances back? Then there > would be three instances, and without cloning the two re-opened ones > would also share the same resources. Is this a desirable use case or > would it be okay to restrict reopen() so that it can only create one > new instance?

Hmmm good point.

Actually, we could allow more then one re-open call if we take the following approach: every time a cloned Reader "borrows" a reference to a sub-reader, it increments a counter (call it the "referrer count"). When the Reader is closed, it decrements the count (by 1) for each of the sub-readers.

Then, any reader should refuse to do a writing operation if its "referrer" count is greater than 1, because it's being shared across more than one referrer.

This way if you have a reader X and you did reopen to get Y and did reopen again to get Z then the shared sub-readers between X, Y and Z would not allow any write operations until 2 of the three had been closed. I think that would work?

BTW this would also allow for very efficient "snapshots" during searching: keeping multiple readers "alive", each searching a different point-in-time commit of the index, because they would all share the underlying segment readers that they have in common. Vs cloning which would have to make many copies of each segment reader.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> This way if you have a reader X and you did reopen to get Y and did > reopen again to get Z then the shared sub-readers between X, Y and Z > would not allow any write operations until 2 of the three had been > closed. I think that would work?

Yes I think it would work. However, this approach has two downside IMO:

reopen() becomes more complicated and restricted for the user. With the cloning approach the user doesn't have to care about when index modifications are not allowed. IndexReader instances returned by open() or reopen() can be used exactly the same without any restrictions.
We have to be very careful about cross-referencing multiple readers. If for some reason any references between two or more readers are not cleared after one was closed, then that reader might not become GC'd. I'm not saying it's not possible or even very hard, we just have to make sure those things can't ever happen.

Of course the cloning approach has disadvantages too. For custom readers clone() has to be implemented in order to make reopen() work correctly. Also reopen() is more expensive in case of closeOldReader=false. Well we could certainly consider the lazy clone approach that you suggested, Mike, but we have to be careful about the cross-referencing issue again.

So I'm really not sure which way the better one is. I think I'm slightly in favor for the cloning approach, so that reopen() returns instances that are completely independant from each other, which seems cleaner IMO. What do others think?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> We have to be very careful about cross-referencing multiple readers. > If for some reason any references between two or more readers are > not cleared after one was closed, then that reader might not become > GC'd. I'm not saying it's not possible or even very hard, we just > have to make sure those things can't ever happen.

One correction: there should be no cross-referencing, right? The only thing that's happening is incrementing & decrementing the "referrer count" for a reader? (But, you're right: the "copy on write" approach to cloning would have cross-referencing).

I think the downside of this proposed "shared readers" (non-cloning) approach is that you can't delete/setNorm on the clone until you close the original. But I think that's fairly minor? Also if you really need a full deep copy of your reader you could just open a new one (ie, not use "reopen")?

I think the big upside of "shared readers" is reopening becomes quite a bit more resource efficient: the cost of re-opening a reader would be in proportion to what has actually changed in the index. EG, if your index has added a few tiny segments since you last opened then the reopen would be extremely fas but with cloning you are forced to make a full deep copy of all other [unchanged] segments.

asfimport commented 17 years ago

Chris M. Hostetter (@hossman) (migrated from JIRA)

in deciding "to clone or not clone" for the purposes of implementing reopen, it may make sense to step back and ask a two broader questions...

1) what was the motivation for approaching reopen using clones in the first place 2) what does it inherently mean to clone an indexreader.

the answer to the first question, as i understand it, relates to the issue of indexreaders not being "read only" object ... operations can be taken which modify the readers state, and those operations can be flushed to disk when the reader is closed. so readerB = readerA.reopen(closeOld=false) and then readerA.delete(...) is called, there is ambiguity as to what should happen in readerB. so the clone route seems pretty straight forward if and only if we have an unambiguous concept of cloning a reader (because then the clone approach to reopen becomes unambiguous as well). alternately, since reopen is a new method, we can resolve the ambiguity anyway we want, as long as it's documented ... in theory we should pick a solution that seems to serve the most benefit ... perhaps that solution is to document reopen with "if reopen(closeOld=false) returns a new reader it will share state with the current reader, attempting to do the following operations on this new reader will result in undefined behavior"

the answer the the second is only important if we want to go the cloning route ... but it is pretty important in a larger sense then just reopening ... f we start to say that any of the IndexReader classes are Clonable we have to be very clear about what that means in all cases where someone might clone it in addition to reopen ... in particular, i worry about what it means to clone a reader which has already had "write" operations performed on it (something the clone based version of reopen will never do because a write operation indicates the reader must be current), but some client code might as soon as we add the Clonable interface to a class.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

> in deciding "to clone or not clone" for the purposes of implementing > reopen, it may make sense to step back and ask a two broader questions...

I agree!

> 1) what was the motivation for approaching reopen using clones in the > first place

Good summarization! You are right, I started the clone approach because IndexReaders are not "read only" objects.

> "if reopen(closeOld=false) returns a new reader it will share state with > the current reader, attempting to do the following operations on this > new reader will result in undefined behavior"

This would mean, that we simply warn the user that performing write operations with the re-opened indexreader will result in undefined behavior, whereas with Mike's approach we would prevent such an undefined behavior by using reference counting.

Hmm, so what are our long-term plans for indexreaders? If our goal is to make them read-only (we can delete via IndexWriter already), then I think adding those warning comments to reopen(), as you suggest Hoss, would be sufficient.

If everybody likes this approach I'll go ahead and submit a new patch.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Hmm, so what are our long-term plans for indexreaders? If our goal is to make them read-only (we can delete via IndexWriter already), then I think adding those warning comments to reopen(), as you suggest Hoss, would be sufficient.

I think that's a fine direction. Note however that IndexWriter implements delete by calling IndexReader.delete(). That method could be made package-private, so that users cannot call it, but then this makes it impossible for someone to subclass IndexReader from a different package. So perhaps delete() needs to move to a subclass of IndexReader? That gets messy...

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I think that's a fine direction. Note however that IndexWriter implements delete by calling IndexReader.delete(). That method could be made package-private, so that users cannot call it, but then this makes it impossible for someone to subclass IndexReader from a different package. So perhaps delete() needs to move to a subclass of IndexReader? That gets messy...

Actually all of the lock/commit logic moved from IndexReader to DirectoryIndexReader already, and the delete logic is in SegmentReader, which subclasses DirectoryIndexReader. So we could remove the deleteDocument() API from IndexReader but leave it in DirectoryIndexReader. Then it would still be possible to extend IndexReader from a different package just as today, and IndexWriter could use DirectoryIndexReader for performing deletes. These changes should be trivial.

asfimport commented 17 years ago

Doug Cutting (@cutting) (migrated from JIRA)

Got it. IndexWriter only works with SegmentReaders anyway, not with an arbitrary IndexReader implementation: IndexReader is extensible, but IndexWriter is not. I'd (momentarily) forgotten that. Nevermind.

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

Having a read-only IndexReader would (should?) mean being able to remove "synchronized" from some things like isDeleted()... a nice performance win for multi-processor systems for things that didn't have access to the deleted-docs bitvec.

> If our goal is to make them read-only (we can delete via IndexWriter already)

But you can only delete-by-term. It's more powerful to be able to delete by docid, however I manage to come up with it. So I think deleteDocument(int id) should either be moved to a subclass. same with setNorms?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

So I think deleteDocument(int id) should either be moved to a subclass. same with setNorms?

Or we could take the approach you suggested in http://www.gossamer-threads.com/lists/lucene/java-dev/52017.

That would mean to add a callback after flush to get a current IndexReader to get the docids and to use the IndexWriter then to perform deleteDocument(docId) or setNorm(). These methods could also take an IndexReader as argument, e. g. deleteDocument(IndexReader reader, int docId), which would throw an IOException if the passed in reader is stale (i. e. docids have changed since the reader was opened). Just as IndexReader does it today. Does this make sense or am I missing something?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

I just opened #2106 and would like to move discussions related to "read-only" IndexReaders to that issue.

As for reopen() I'd like to go with Hoss' suggestion for now and add warning comments to reopen() saying that using an re-opened IndexReader with closeOldReader==false for write operations will result in an undefined behavior. Any objections?

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

As for reopen() I'd like to go with Hoss' suggestion for now and add warning comments to reopen() saying that using an re-opened IndexReader with closeOldReader==false for write operations will result in an undefined behavior.

How about just defining the behavior such that any pending changes are flushed. That would make it more useful because you could then reopen readers you used for deletes. An alternative would be a method to explicitly flush changes on a reader, giving one the ability to then reopen it, but I like the former better since it avoids adding another API call.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

How about just defining the behavior such that any pending changes are flushed. That would make it more useful because you could then reopen readers you used for deletes.

Hmm, I'm not sure I understand. A reader which is being used for deletes or setting norms is always current (it owns the write lock), so there should never be the need to re-open such a reader.

However, if you re-open an existing reader which was not used for deletes before and use the new instance (b) to perform deletes, it will result in a undefined behavior for the old reader (a):

IndexReader a = .....
....
IndexReader b = a.reopen();
b.deleteDocument(...);

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

A reader which is being used for deletes or setting norms is always current (it owns the write lock), so there should never be the need to re-open such a reader.

I was thinking about the "add some docs" then "delete some docs" scenario: One currently needs to close() the deleting reader to open an IndexWriter. If IndexReader.commit() was public, then one could simply flush changes, and then call reopen() after the IndexWriter was done adding new documents. But perhaps longer term, all deletions should be done via the IndexWriter anyway.

if you re-open an existing reader which was not used for deletes before and use the new instance (b) to perform deletes

Ah, thanks for that clarification. I guess that should remain undefined.

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Hmm one other thing: how should IndexReader.close() work? If we re-open a reader (a is the old reader, b is the new one), and then someone calls a.close(), then a should not close any resources that it shares with b.

One way to make this work would be reference counting. Since we want to avoid that, we could simply restrict reopen() from being called twice for the same reader. Then there would never be more than 2 readers sharing the same ressources. The old reader a would remember that reopen() was called already and would not close the shared ressources on close(). However, the new reader b would close all ressources, meaning that reader a would not work anymore after b.close() was called. Thoughts?

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think reference counting would solve this issue quite nicely. How come we want to avoid reference counting? It seems like the right solution here.

The implementation seems simple. When a reader is opened, it starts with RC 1. When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC). When a MultiReader needs to use the reader, it calls incref. And when that MultiReader is done with it, it calls decref. Whenever the RC hits 0 it's safe to free all resources.

Wouldn't that work?

asfimport commented 17 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

> When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC)

But if a reader is shared, how do you tell two real closes from an erronous double-close?
Perhaps have a top level close() that is only invoked once via isClosed, and a projected doClose() that a multi-reader would use that actually does the decref?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

The implementation seems simple. When a reader is opened, it starts with RC 1. When it is closed, it decrefs the RC and marks itself closed (to make sure double-close does not re-decref the RC). When a MultiReader needs to use the reader, it calls incref. And when that MultiReader is done with it, it calls decref. Whenever the RC hits 0 it's safe to free all resources.

Wouldn't that work?

You're right, for a MultiReader and ParallelReader this would work and wouldn't be hard to implement.

It is quite different when it comes to SegmentReaders. SegmentReader.reopen() checks SegmentInfos if there is a segment with the same name but newer normGen or delGen. If there is one then a new SegmentReader instance is created that reuses resources like e. g. TermInfosReader and loads the new generation of the del or norm file.

Now you can end up having a bunch of SegmentReaders that share the same resources but don't know about each other. The reference counting would have to happen somewhere else, e. g. in the TermInfosReader. Of course this is doable, but it's a special case and more complicated compared to the "restrict reopen() to only be called once"-approach.

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

> But if a reader is shared, how do you tell two real closes from an erronous double-close? > Perhaps have a top level close() that is only invoked once via isClosed, and a projected doClose() that a multi-reader would use that actually does the decref?

Yes I think that's the right approach.

> It is quite different when it comes to SegmentReaders. SegmentReader.reopen() checks SegmentInfos if there is a segment with the same name but newer normGen or delGen. If there is one then a new SegmentReader instance is created that reuses resources like e. g. TermInfosReader and loads the new generation of the del or norm file. > > Now you can end up having a bunch of SegmentReaders that share the same resources but don't know about each other. The reference counting would have to happen somewhere else, e. g. in the TermInfosReader. Of course this is doable, but it's a special case and more complicated compared to the "restrict reopen() to only be called once"-approach.

For SegmentReader, I think on reopen(), everything would be shared except norms and/or deletedDocs right? In which case couldn't all cascaded reopens hold onto the original SegmentReader & call doClose() on it when they are closed? (Ie it is the "owner" of all the shared parts of a SegmentReader). Then, deletedDocs needs no protection (it has no close()), and for Norms we could push the reference counting down into it as well?

We wouldn't need to push reference counting into all the readers that a SegmentReader holds because those are always shared when a SegmentReader is reopened?

asfimport commented 17 years ago

Michael Busch (migrated from JIRA)

Hi Mike,

I'm not sure if I fully understand your comment. Consider the following (quite constructed) example:

IndexReader reader1 = IndexReader.open(index1);  // optimized index, reader1 is a SegmentReader
IndexReader multiReader1 = new MultiReader(new IndexReader[] {reader1, IndexReader.open(index2)});
... // modify index2
IndexReader multiReader2 = multiReader1.reopen();  
// only index2 changed, so multiReader2 uses reader1 and has to increment the refcount of reader1
... // modify index1
IndexReader reader2 = reader1.reopen();
// reader2 is a new instance of SegmentReader that shares resources with reader1
... // modify index1
IndexReader reader3 = reader2.reopen();
// reader3 is a new instance of SegmentReader that shares resources with reader1 and reader2

Now the user closes the readers in this order:

multiReader1.close();
multiReader2.close();
reader2.close();
reader3.close();

reader1 should be marked as closed after 2., right? Because multiReader1.close() and multiReader2.close() have to decrement reader1's refcount. But the underlying files have to remain open until after 4., because reader2 and reader3 use reader1's resources.

So don't we need 2 refcount values in reader1? One that tells us when the reader itself can be marked as closed, and one that tells when the resources can be closed? Then multiReader1 and multiReader2 would decrement the first refCount, whereas reader2 and reader3 both have to "know" reader1, so that they can decrement the second refcount.

I hope I'm just completely confused now and someone tells me that the whole thing is much simpler :-)

asfimport commented 17 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

It's not nearly this complex (we don't need two ref counts). If we follow the simple rule that "every time reader X wants to use reader Y, it increfs it" and "whenver reader X is done using reader Y, it decrefs it", all should work correctly.

Also we should think of "close()" as the way that the external user does the decref of their reader. We just special-case this call, by setting isOpen=false, to make sure we don't double decref on a double close call.

Let's walk through your example...

I'm assuming in your example you meant for reader2 and reader3 to also be SegmentReaders? Ie, the changes that are happening to the single-segment index1 are just changes to norms and/or deletes. If not, the example is less interesting because reader1 will be closed sooner :)

Also in your example let's insert missing "reader1.close()" as the very first close? (Else it will never be closed because it's RC never hits 0).

When reader1 is created it has RC 1.

When multiReader1 is created, reader1 now has RC 2.

When multiReader2 is created, reader1 now has RC 3.

When reader2 is created (by reader1.reopen()), it incref's reader1 because it's sharing the sub-readers in reader1. So reader1 now has RC 4.

When reader3 was created (by reader2.reopen()), it incref's reader2 because it's sharing the sub-readers reader2 contains. So reader1 is still at RC 4 and reader2 is now at RC 2.

Now, we close.

After reader1.close() is called, reader1 sets isOpen=false (to prevent double close by the user) and RC drops to 3.

With multiReader1.close(), multiReader1 is not at RC 0, and so it decrefs all readers it was using, and so reader1 RC is now 2.

With multiReader2.close(), likewise it is now at RC 0 and so it decrefs all readers it was using, and so reader1 RC is now 1.

With reader2.close(), it decrefs its own RC, however that brings its RC to 1 (reader3 is still referring to it) and so it does not decref the reader1 that it's referring to.

Finally, with reader3.close(), it is now at RC 0 and so it decrefs the reader2 it refers to. This brings reader2's RC to 0, and so reader2 decrefs the reader1 that it's referring to. Which brings reader1's RC to 0, and so reader1 finally closes all its internal sub-readers.

apache / lucene

IndexReader.reopen() [LUCENE-743] #1818

2106

1906

2062