apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

Make it possible to subclass SegmentReader [LUCENE-2345] #3421

Open asfimport opened 14 years ago

asfimport commented 14 years ago

I would like the ability to subclass SegmentReader for numerous reasons:

currently this isn't really possible

I propose adding a SegmentReaderFactory that would allow creating custom subclasses of SegmentReader

default implementation would be something like:

public class SegmentReaderFactory {
  public SegmentReader get(boolean readOnly) {
    return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
  }

  public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
    return newSegmentReader(readOnly);
  }
}

It would then be made possible to pass a SegmentReaderFactory to IndexWriter (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, etc)

I could prepare a patch if others think this has merit

Obviously, this API would be "experimental/advanced/will change in future"


Migrated from LUCENE-2345 by Tim Smith, updated May 09 2016 Attachments: LUCENE-2345_3.0.patch, LUCENE-2345_3.0.plugins.patch

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

+1

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

that's the reassurance i needed :)

will start working on a patch tomorrow will take a few days as i'll start with a 3.0 patch (which i use), then will create a 3.1 patch once i've got that all flushed out

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Ahem, that directly clashes with my inflight patch?

I'm going to land create/reopen/close refactoring pretty soon, and then finish plugins. (Which was an arduous task before, as initialization paths for SegmentReaders are really, really insane)

I'm on this patch for quite some time, I understand that, so maybe we can settle on a deadline? Don't like to see the effort vaporate, also merging is gonna be hell with flex branch alone, don't want to double it :)

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

i'll do my initial work on 3.0 so i can absorb the changes now and will post that patch

at which point, i can wait for you to finish whatever you need, or we can just incorporate the same ability into your patch for the other ticket i would just like to see the ability to subclass SegmentReader's on 3.1 so i don't have to port a patch when i absorb 3.1 (just use the "finalized" apis)

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

Here's a patch against 3.0 that provides the SegmentReaderFactory ability (not tested yet, but i'll be doing that shortly as i integrate this functionality)

It adds a SegmentReaderFactory.

The IndexWriter now has a getter and setter for setting this

SegmentReader has a new protected method init() which is called after the segment reader has been initialized (to allow subclasses to hook this action and do additional initialization, etc

added 2 new IndexReader.open() calls that allow specifying the SegmentReaderFactory

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

The IndexWriter now has a getter and setter for setting this

If this is not expected to change during the lifetime of IW, I think it should be added to IWC when you upgrade the patch to 3.1.

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

that was my plan

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I think we should only commit this only on 3.1 (new feature)?

Earwin: will this change really conflict w/ your ongoing refactoring (to have DirReader subclass MultiReader)? It seems somewhat orthogonal?

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

will this change really conflict w/ your ongoing refactoring (to have DirReader subclass MultiReader)? It seems somewhat orthogonal?

Ongoing contains a massive rework of how open/reopen/clone/close is done. Folding copypaste between DirReader and MultiReader is a byproduct. The aim is to have clean initialization, get rid of of init/reinit/moreinit methods, moving the code to constructors. This alone plays bad with factories.

The next step, plugins - conflicts ideologically. Plugins allow extension by composition, which is (in my view) much more clean than subclassing. Landing them without refactoring all this mess beforehand is dangerous, I already hit some places where reopen/clone were done wrong in respect to existing reader fields. Even more - NRT reader lifecycle is really messy - it beginnings as a reader aimed only for merging, and then being upgraded in two separate stages is finally wrapped in DirReader to be used for actual searching. The factory/init() approach ignores this, and each user of this API will be on his own to separate lightweight readers from full-fledged ones.

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Earwin, w/o knowing too much about the details of your work, I wanted to comment on "get rid of of init/reinit/moreinit methods, moving the code to constructors". I work now on Parallel Index and one of the things I do is extend IW. Currently, IW's ctor code performs the initialization, however I'm thinking to move that code to an init method. The reason is to allow easy extensions of IW, such as #3406. There I'm going to add a default ctor to IW, accompanied by an init method the extending class can call if needed. So what I'm trying to say is that init methods are not always bad, and sometimes ctors limit you. Perhaps it would make sense though in what you're trying to do ...

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Non-final/private method calls from ctors is discouraged, as it creates problems:

Because of this, such init() methods should be private or at least final and never public!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here a good explanation: http://www.informit.com/articles/article.aspx?p=20521

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK, I guess for now let's hold off on this issue while we iterate at least with Earwin's first refactoring effort...?

Earwin, can you post a patch w/ your current state, even if it's "rough"? There seems to be alot of interest/opinions on how to "fix" things here :) Both IndexWriter & SegmentReader are "needing" extensibility... but this is a big change so I'm hoping we can first just get your refactoring done before adding extensibility.

Tim, do you think the plugin model ("extension by composition") would be workable for your use case? Ie, instead of a factory enabling subclasses of SegmentReader?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Thanks Uwe, I know that ctor is the preferred way, and in the process of introducing IWC I delete IW.init which all ctors called and pulled all the code to IW ctor. I will make that init() on IW final. But sometimes putting code in init() is not bad (and it's used in Lucene elsewhere too (e.g. PQ and up until recently IW).

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

More often than not init() methods are a sign of bad design. I.e. in your case extending IW is crazy. You should have an interface capturing IW methods and two implementations - one writing to the index and another delegating to its subwriters. You don't do DirectoryReader extends SegmentReader, do you? They both extend lucene-style-interface IndexReader.

Lucene's back compat policy got people used to writing and digesting freaky code, and I'm not going to fight generally against it, that's a lost cause : ) But in this exact case (readers) stuffing everything in constructors, defining all fields I can final, replacing SR.openDocStores/loadTermsIndex with reopen()-like method allows me to tackle (at least somewhat) lifecycle complexity. Javac will force me to either initialize something or explicitly leave it null. Besides some real bugs in this code, I cleaned up cases where a field was inited twice, just in case! Poor developers got lost in init() methods : }

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Earwin, can you post a patch w/ your current state, even if it's "rough"?

Ahem. Right now it's more or less finished for Multi/Directory/MutableDirectory/WriterBackedDirectory-readers. But the SegmentReader is in shambles (i.e. does not compile yet). Should I post asis?

asfimport commented 14 years ago

Shai Erera (@shaie) (migrated from JIRA)

Earwin, I wholeheartedly agree with what you wrote. If we could refactor IW and extract it to a set of interfaces, then I agree (and Michael B. has an issue open for that). I think though that IW's API is already that interface (give or take few methods). So perhaps this can be an easy refactoring - introduce an Indexer (a la Searcher) class (or interface) w/ all of IW public methods, and then let PW extend/impl that class/interface as well as IW. We can also consider making IW itself final this way (though bw police will prevent it :)).

Then when PW sets up the slices, it can create them as IW or any other IW-like implementation it needs them to impl. If it sounds good enough to become its own issue, I can open one and we can continue discussing it there (and leave that issue focused on extending SR). Then I'll hold off w/ #3406, or simply rename it to reflect that Indexer API.

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

I think we should only commit this only on 3.1 (new feature)?

3.1 only of course (just posted a 3.0 patch now as that's what i'm using and i need the functionality now)

Tim, do you think the plugin model ("extension by composition") would be workable for your use case? Ie, instead of a factory enabling subclasses of SegmentReader?

As long as the plugin model allows the same capabilities, that could work just fine and could be the final solution for this ticket.

I mainly need the ability to add data structures to a SegmentReader that will be shared for all SegmentReader's for a segment, and then add some extra meta information on a per instance basis

Is there a ticket or wiki page that details the "plugin" architecture/design so i could take a look?

However, would the plugins allow overriding specific IndexReader methods?

I still would see the need to be able to override specific methods for a SegmentReader (in order to track statistics/provide changed/different/faster/more feature rich implementations) I don't have a direct need for this right now, however i could envision needing this in the future

Here's a few requirements i would pose for the plugin model (maybe they are already though of):

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

Here's a patch (again, against 3.0) showing the minimal API i would like to see from the plugin model

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

found one issue with the plugins patch

With NRT indexing, if the SegmentReader is opened with no TermInfosReader (for merging), then the plugins will be initialized with a SegmentReader that has no ability to walk the TermsEnum.

I guess SegmentPlugin initialization should wait until after the terms index is loaded or have another method for catching this event to the SegmentPlugin interface

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

I mentioned this issue earlier. My patch removes loadTermsIndex method from SegmentReader and requires you to reopen it. At that moment you can stuff it with a set of plugins without leaving 'final' paradise.

There are still things to consider. Some SR guts could be converted to plugins themselves, so you can override them with your implementation if you wish. If that is so, there should be a way to decide which of the plugins should be loaded for current SR mode. My previous design loaded plugins only for full-fledged readers.

Is there a ticket or wiki page that details the "plugin" architecture/design so i could take a look?

The design itself was never published, rather it was discussed several times on a mailing list, and simmered inside my head for some time. I have a first impl running at my workplace, but it is really fugly :) I will flesh out a proper description after the refactoring, but it has all your points in one or another form. Biggest differences are - they are keyed by Class not by String, they declare multiple interfaces they provide, they declare dependencies, they don't have to implement/extend anything, as all the hooks are on the factory classes. Also, they extend not only SegmentReader, but the whole hierarchy - SR, MR, DR, whatever.

The problem of overriding SR methods is solved by delegating these methods to plugins, which can be either default or user-provided. (But remember the question of which subset should be initialized for which SR mode)

asfimport commented 14 years ago

Marvin Humphrey (migrated from JIRA)

> Is there a ticket or wiki page that details the "plugin" architecture/design > so i could take a look?

FWIW, KinoSearch has a complete prototype implementation of this design, based loosely on the mailing list conversations that Earwin referred to.

http://www.rectangular.com/svn/kinosearch/trunk/core/

asfimport commented 14 years ago

Tim Smith (migrated from JIRA)

My patch removes loadTermsIndex method from SegmentReader and requires you to reopen it.

that's definitely much cleaner and would solve the issue in my current patch (sadly i'm on 3.0 and want to keep my patch there at a minimum until i can port to all the goodness on 3.1).

Also, they extend not only SegmentReader, but the whole hierarchy - SR, MR, DR, whatever.

i just wussed out and just did only the SegmentReader case as thats all i need right now

as all the hooks are on the factory classes

could you post your factory class interface? If i base my 3.0 patch off that i can reduce my 3.1 port overhead.

are there any tickets tracking your reopen refactors or your plugin model? If not, feel free to retool this ticket for your plugin model for Index Readers as that will solve my use cases (and then some)

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Just created #3431 to track reopen stuff. No issue for plugin model yet, but I'll probably create it, can't edit this one, I'm no committer.

asfimport commented 11 years ago

Steven Rowe (@sarowe) (migrated from JIRA)

Bulk move 4.4 issues to 4.5 and 5.0

asfimport commented 10 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Move issue to Lucene 4.9.