apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.7k stars 1.04k forks source link

Refactoring of FilteredTermsEnum and MultiTermQuery [LUCENE-2110] #3186

Closed asfimport closed 14 years ago

asfimport commented 14 years ago

FilteredTermsEnum is confusing as it is initially positioned to the first term. It should instead work like an uninitialized TermsEnum for a field before the first call to next() or seek(). FilteredTermsEnums cannot implement seek() as eg. NRQ or Automaton are not able to support this. Seeking is also not needed for MTQ at all, so seek can just throw UOE. This issue changes some of the internal behaviour of MTQ and FilteredTermsEnum to allow also seeking in NRQ and Automaton (see comments below).


Migrated from LUCENE-2110 by Uwe Schindler (@uschindler), resolved Dec 06 2009 Attachments: LUCENE-2110.patch (versions: 6), LUCENE-2110-nextSeekTermUpd.patch Linked issues:

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I will work on this tomorrow and provide a patch. I will also update the patch in #2680 to move the initial seek out of ctor (its easy, see below).

The setEnum method should be renamed in something like setInitialTermRef(). So the default impl of next() will seek to the correct term and do not seek by default (iterate all terms of field).

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here my patch for this.

I rewrote the whole FilteredTermsEnum and made it natively support seeking needed for NRQ and Automaton.

This initial patch is for review only, but all tests pass. I will try to modify Robert's patch, as soon as he provided me an updated Patch for Automaton flex branch.

The enum works different than before: It is positioned before the first term (like it should), seeking is no longer supported (as not needed for MTQ) and not implementable for seeking enums like NRQ or Automaton.

In the constructor you give index reader and field name, as TermsEnum can only iterate one field in flex, this is no limitation.

For non-seeking enums you can set the initial term to seek to with setInitialSeekTerm(TermRef) in the ctor. The rest of the enum then behaves as before.

For seeking enums like Automaton/NRQ you override a secondary iterator method nextSeekTerm() that returns the next TermRef the underlying iterator should seek to. This method is called, when accept() returns END (and also on the first next() call, of course). The default impl of this method just returns the initial seek term as explained above one time and then null.

Everything else stands in the javadocs.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, I will look at re-porting automaton to flex so you can test this. (now it has good tests and sort order/unicode crap is fixed and they should all pass).

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

This is a great improvement Uwe... I like it.

Is an MTQ allowed to return nextSeekTerm's out of order? (I know NRQ/automaton don't need to do so, but, if it's fine we should maybe call that out in the javadocs...). Though, FilteredTermsEnum, being a "TermsEnum", is "supposed" to return terms in getTermComparator() order... however its consumers (the rewrite methods for MTQ) usually don't in fact care. Hmm I wonder if it should even subclass TermsEnum? It doesn't seek and it's free to return terms in a different order...

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Updated patch also incorporating the needed changes for SingleTermsEnum to make it work with new API. Now it is at least a 5-code-liner :-)

I also fixed a method call instead of parameter usage in TermRangeTermsEnum. Also added Mike's comment In my opinion, we should keep it as TermsEnum, even when seeking does not work, which is documented. In my code I often use PrefixTerm(s)Enum for autocomplete cases - works good - and for that it is only handles as a Term(s)Enum for iterating making it simplier to reuse code working on Term(s)Enums. Also made some mebers final, I forgot this during restructuring the code.

What I forgot to mention: I made the abstract methods in FilteredTermsEnum also throw IOException, so maybe subclasses, doing strange things, would compile.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

In my opinion, we should keep it as TermsEnum, even when seeking does not work, which is documented

OK, let's keep it as subclassing TermsEnum. Maybe we should relax the docs for TermsEnum to state that each subclass determines order. Nothing in TermsEnum itself requires a particular order.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch with the attribute support of #3185.

I will now port Automaton and apply will provide a combined patch there.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Hey Uwe, since your editing this code anyway, wanna add a comment fix for the ref of TermInfo here?

+          // Loading the TermInfo from the terms dict here
+          // should not be costly, because 1) the
+          // query/filter will load the TermInfo when it
+          // runs, and 2) the terms dict has a cache:
asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

After porting Automaton, I realized, that the seeking code should be changed and made a little bit more flexible.

AcceptStatus can now return 5 stati:

nextSeekTerm() should always return a greater term that the last one before seeking. This is asserted by NRQ. It is not bad to do this, but after that the enum is no longer correctly sorted. Also, if the consumer reaches the last term of the underlying enum, call next() will end enumeration and so further terms in the nextSeekTerm() interation will not consulted (the same happens when END is returned in accept, of course).

If nextSeekTerm() returns null, the enumeration is also ended, so it is not required to return AcceptStatus.END instead of X_AND_SEEK.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Mark: I do not know about what you are talking about (sorry, my brain is fuming after automaton).

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

No problem, we can get it after - its not really related, just figured since you were patching here anyway and I happened to notice it will taking a look at the patch:

TermInfo is no longer used in flex, but its referenced in the above comment, in MTQ.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, I really like what you have done here (as commented on LUCENE-1606)

Seeking around in a filteredtermsenum is even simpler here. (in my opinion, this thing is very tricky with trunk and it is good to simplify)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

nextSeekTerm() should always return a greater term that the last one before seeking.

Uwe, why was this constraint needed? What goes wrong if we allow terms to be returned out of order? The consumers of this (MTQ's rewrite methods) don't mind if terms are out of order, right?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It will work (theoretically) but can fail: if you seek to the last term and accept it, the next call to next() will end the enum, even if there may be more positions to seek. You cannot rely on the fact that all seek terms are visited. Because of that it should be foreward only, if other, you must know what you do

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I have a solution for this problem: If the end of the enum is reached i just asks for a new term is seek==true (that is what iwas before). But nextPrefixTerm() gets the information that the end was already finished and could return null then. This is important for automaton, because it would loop endless else (because it would produce terms and terms and terms... in nextSeekTerm).

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Attached is patch that allows the TermsEnum to go backwards and not break if end of underlying TermsEnum is reached after next() or seek().

The method nextSeekTerm() gets a boolean if the underlying TermsEnum is exhausted. Enums that work in order can the simply return null to break iteration. But they are free to reposition to a term before.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

fixed patch - i have to stop for today.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Stop everything I get a collaps!!!!! Again wrong patch.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Now the final one.

I somehow need a test enum which does very strange things like seeking forward and backwards and returning all strange stati.

Will think about one tomorrow.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert and me analyzed the latest change. It is so complex and I am really not sure, if we should do this. It is impossible to maintain this.

We should enforce only seeking forwards (even if MTQ could accept terms out of order). Violating TermsEnums order stupid, so we should use the patch before. NRQ and also Automaton enforce stepping forwards only.

Mike?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yeah compared to the last patch, the backwards seeking makes the code more complex in my opinion.

i do not understand why a MTQ would need to backwards seek? Can we say instead if you want to do such a thing with flexible indexing, that the way is to instead define custom sort order in your codec?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

+1, I reverted it here completely. It is not senseful to support unordered filtered enums. If somebody wants to implement that he should do it otherwise by overriding next() himself and not use nextSeekTerm() and accept().

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

We should enforce only seeking forwards (even if MTQ could accept terms out of order). Violating TermsEnums order stupid, so we should use the patch before. NRQ and also Automaton enforce stepping forwards only.

OK, I hadn't realized this was adding so much complexity to the code, so I agree let's revert it and require FTE to always move forward according to getTermComparator its actualEnum. We don't have a need for this today, anyway. Design for today ;)

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

From our IRC chat between Mike and me:

[10:19] ThetaPhi: mike: any further comments for 2110? [10:19] mikemccand: am just trying to look at it – waiting on a sloooooow svn up. [10:19] ThetaPhi: this out of order is not trivial and makes code for automaton and NRQ more complicated [10:19] mikemccand: the change is a great step forward [10:19] mikemccand: yeah forget it [10:19] ThetaPhi: yes [10:19] ThetaPhi: you understood the problem [10:19] mikemccand: i had thought it was nearly free [10:19] mikemccand: i don't fully understand the problem [10:19] mikemccand: impossible to keep up with you!! [10:20] ThetaPhi: when the underlying enum is exhausted (I was exhausted yesterday, too), it stops processing, even if there are more nextSeelkTerms [10:20] mikemccand: yes I saw your exhausted pun ;) [10:20] mikemccand: wait – if a TermsEnum becomes exhaused, you're unable to seek it again? is that the issue? [10:20] ThetaPhi: waht you could do is, that after the underlying enum is exhausted, that you call nextSeekTerm and reposition [10:21] ThetaPhi: yes [10:21] ThetaPhi: you could seek it [10:21] ThetaPhi: and this worked at the beginning (first patch without the additional enum constants) [10:21] ThetaPhi: what it did was: when the underlying enum was exhausted it simply called nextSeekTerm and seeked [10:22] mikemccand: so TermsEnum API allows seeks after next()'ing to exhaustion, right? (I think it should but maybe it' doesn't somewhere?) [10:22] ThetaPhi: which was ok [10:22] ThetaPhi: mikemccand: your enums works perfect [10:22] ThetaPhi: they can be seeked everytime even when exhausted [10:22] mikemccand: ok [10:23] ThetaPhi: the problem is logic in filteredtermsenum [10:23] mikemccand: ok [10:23] ThetaPhi: it gets very complicated when you want to support two things: [10:23] ThetaPhi: a) seeking on request (NO_AND_SEEK, YES_ANDSEEK) [10:24] ThetaPhi: b) and want to consume all nextSeekTerms() [10:24] ThetaPhi_: if somebody seeks after _ANDSEEK to e.g. the last term of the enum [10:24] ThetaPhi: in the current impl it will read that term [10:25] ThetaPhi: call termCompare and so on [10:25] ThetaPhi: if this accepts the term or not accepts the term [10:25] ThetaPhi: it will call next() on the underlying enum again [10:25] ThetaPhi: which then returns null [10:26] ThetaPhi: with the current patch it will then also end the filteredtermsenum [10:26] mikemccand: ok – so the termination logic inside FTE.next() got really hairy if on getting null from the actualEnum you had to consider going back for another seek term? [10:26] ThetaPhi: yes [10:26] ThetaPhi: i tried it yesterday [10:26] mikemccand: we certainly don't need to support thist oday [10:26] ThetaPhi: and I had no good test [10:27] ThetaPhi: yes exactly [10:27] mikemccand: if somehow this ever becomes needed then we add it then [10:27] mikemccand: the new FTE looks great [10:27] ThetaPhi: the idea was to provide a param to nextSeekTerm which denotes if the underlying enum ended [10:27] mikemccand: NRQ is soooo much simpler :) [10:28] ThetaPhi: the idea was to call even on exit nextSeekTerm(true), where triue meant exhausted [10:28] mikemccand: i see – so it involved different API [10:28] ThetaPhi: TEnums like NRQ or automaton only going forward then know, not to provide seek terms again [10:28] ThetaPhi: this was important for Automaton not to fall into endless loops [10:29] mikemccand: this is hairy stuff :) [10:29] ThetaPhi: and good for NRQ to not provide any further terms [10:30] ThetaPhi: when I tried to implement there was always a problem and you were not able to correctly define waht should happen then [10:30] mikemccand: how come the switch in FTE.next doesn't have "case NO" on the return from accept(term)? [10:31] ThetaPhi: thats not needed it just falls through and loops to next term [10:31] ThetaPhi: you could add it as NO: break; [10:31] * mikemccand ahh got it – maybe just add comment saying so [10:31] ThetaPhi: good idea, will do [10:32] ThetaPhi_: the NO_ANDseek case is the same it just says doSeek = true [10:32] ThetaPhi: and for yes it returns the term in both cases, but records doSeek for the next time next() is called [10:32] mikemccand: i wonder if we could simply add a seek term to AcceptStatus? [10:33] mikemccand: vs calling the separate nextSeekTerm method [10:33] ThetaPhi: enum's contents are final [10:33] mikemccand: not sure it'd be better... just wondering [10:33] ThetaPhi: because each constant is a singleton, they should be unmodifiable [10:34] mikemccand: ie, fix AcceptStatus to be like TermsEnum.SeekResult [10:34] mikemccand: so it returns status, but also an optional seekTerm which if null means no seeking [10:34] ThetaPhi: and how to do this, not with an enum constant [10:34] ThetaPhi: because they are constants [10:35] ThetaPhi: and createing a new object on each accept call seems not ideal [10:35] mikemccand: sorry, i was confused – SeekStatus used to be a class that had two attrs – a TermRef, and the enum; i simplified that a while ago [10:36] ThetaPhi: aaaah [10:36] ThetaPhi: now its an enum [10:36] ThetaPhi: i do not think its a ood idea to return new objects [10:36] mikemccand: right. for a while I had no TermsEnum.term() [10:36] mikemccand: the enums would reuse their status object [10:36] mikemccand: ie API would require that this is fine [10:37] ThetaPhi: but enums are singleton [10:37] ThetaPhi: for the whole JVM [10:37] mikemccand: if we did that... then we could go back to YES/NO/END, and, and seekTerm is then orthogonal [10:37] ThetaPhi: ok [10:37] ThetaPhi: so a protected setSeekTerm [10:38] mikemccand: we'd make a new "AcceptResult" class. has "AcceptStatus status" member, and also "TermRef seekTerm" [10:38] ThetaPhi: then we are back at the state before, we should ask robert [10:38] mikemccand: ok [10:38] ThetaPhi: for him the separation of accept and nextSeekTerm was good [10:38] mikemccand: ahh i see [10:39] ThetaPhi: but he could do the calculation of the next sring in accept, too [10:39] mikemccand: though, can't he simply emulate? ie, internally call a private nextSeekTerm, stick it on the returned AcceptResult, and return that? [10:39] mikemccand: right [10:39] ThetaPhi: telephone... [10:39] mikemccand: do we ever call nextSeekTerm, unless AcceptStatus was XXX_ANDSEEK? [10:39] mikemccand: hah real world interrupts [10:40] ThetaPhi: thats what is done currentl [10:40] ThetaPhi: brb [10:41] mikemccand: man all of infra seems slow right now... jira giving me intermittent Internal Server Error... svn really slow [10:41] mikemccand: git is tempting [11:04] *** jwtoddii has joined #lucene. [11:15] *** jwtoddii has signed off IRC (). [11:15] ThetaPhi: re [11:15] mikemccand: hello [11:15] ThetaPhi: for me infra works normal [11:16] mikemccand: hmmm [11:16] ThetaPhi: but maybe because i am on the euorpe svn [11:16] mikemccand: you use the eu mirrors? [11:16] mikemccand: ok [11:16] ThetaPhi: but for jira i do not know [11:16] mikemccand: jira is on/off for me now [11:16] ThetaPhi: do not think there is a mirror [11:17] ThetaPhi_: [10:39] mikemccand: do we ever call nextSeekTerm, unless AcceptStatus was XXX_ANDSEEK? [11:17] ThetaPhi: no [11:18] mikemccand: ok [11:18] ThetaPhi: and doSeek is always reset to false whenever a seek occured [11:19] ThetaPhi: the global doSeek is just for the YES_ANDSEEK case [11:19] ThetaPhi: because it exits the loop and returns the term [11:19] ThetaPhi: so on the next call to next() doSeek is true and we seek [11:20] mikemccand: got it [11:21] ThetaPhi: an idea about the nextSeekTerm: [11:21] ThetaPhi: we add a protected setSeekTerm(TermRef) [11:21] ThetaPhi: this sets a private var [11:22] ThetaPhi: accept can set this and return YES or no wahtever [11:22] ThetaPhi: in the case of END it is ignored [11:22] ThetaPhi: but when next() is then called again, if this internal seek varaible is != null it seeks, else it goes forward with next [11:22] ThetaPhi: and after seek it is set to null [11:23] ThetaPhi: and we could add an assert inside this method to check if only seeking forward [11:24] ThetaPhi: to inform implementors of errors [11:24] ThetaPhi: assert termComp.compare(seekTerm, term()) > 0; [11:24] mikemccand: why not simply return the optional seek term along w/ AcceptStatus? it'd require no additional methods, and makes it clear that seeking part of accepting. [11:25] mikemccand: ie the existince of new methods in the api (setSeekTerm, nextSeekTerm) enrich the api – make you wonder when you can call them. eg can ctor call setSeekTerm? [11:25] ThetaPhi: because accept status is an enum constant which is final, singleton and [11:25] ThetaPhi: yes ctor can do it at the moment [11:25] ThetaPhi: its already there its only called setinitialseekterm [11:26] ThetaPhi: (see docs) [11:26] mikemccand: i mean make a new class (AcceptResult), like i described. it contains an AcceptResult.Status (the enum), and a TermRef. [11:26] ThetaPhi: the problem is lots of object creation [11:26] mikemccand: there would be no object creation – we'd reuse? [11:26] ThetaPhi: in my opinion setSeekTerm is simplier to use [11:27] ThetaPhi: it would be the same like the reuse case [11:27] ThetaPhi: and setseekterm can also be used in ctor [11:27] mikemccand: yeah the fact that ctor needs to set initial term does seem to require the extra method [11:28] mikemccand: though i also don't like side-effect methods – you change the internal state of the class, vs returning the seek term [11:28] mikemccand: ie it makes the api stateful, which except for the initial case, is overkill [11:29] ThetaPhi: its an iterator it always has an internal state [11:29] ThetaPhi: i understand your problem [11:29] ThetaPhi: you call in a method that should only accept something a method that changes state [11:30] mikemccand: the internal state is handled by FTE; my subclass is mostly stateless, if i return the seek term [11:30] ThetaPhi: so you cannot call it from outside (which would not wrok, because it is protected) [11:30] ThetaPhi: NRQ also has a state and also automaton [11:31] ThetaPhi: in NRQ it is the linkedlist with seek terms [11:31] ThetaPhi: in Automaton the nextString() [11:31] mikemccand: right, they have their own state because of how they iterate. it's just that seek term need not be stateful. [11:32] ThetaPhi: i think we should ask robert, at the moment i would know how to change his code to support both cases [11:33] mikemccand: in fact, instead of setInitialSeekTerm, could we have getInitialSeekTerm? ie FTE would invoke that once on start [11:33] mikemccand: ok :) [11:33] mikemccand: automaton is scary [11:33] mikemccand: and, powerful [11:33] ThetaPhi: the getInitialSeek term in the current api is hidden behind the iterator [11:34] ThetaPhi: setInitialSeekTerm for the ctor is just a convenience [11:34] ThetaPhi: to prevent subclasses like prefixtermsenum from overrideing and implementing the singleton itertor [11:34] * mikemccand w/ getInitialSeekTerm, there is no sneaky shared state w/ FTE. i mean, FTE keeps track of its state, and each subclass tracks its own state. subclassing wouldn't even be necessary anymore – one could provdie a standonalone TermsEnumFilter, that just has getInitialSeekTerm and accept [11:34] ThetaPhi: (see docs) [11:36] mikemccand: in general I don't like subclassing APIs – it's a bigger surface area (how child works w/ parent, stateful methods, when can i call each method, etc.) [11:36] ThetaPhi: we could remove that also in the current api [11:37] ThetaPhi: only abstract classes should be subclassable [11:37] ThetaPhi: thats asked very often [11:37] mikemccand: gonna be a cold run this AM – 28F out there [11:37] ThetaPhi: calculating [11:37] mikemccand: :) [11:37] mikemccand: invert 32 + 9/5 * C [11:38] ThetaPhi: or ask google [11:38] ThetaPhi: -2.22 °C [11:39] ThetaPhi: (Link: http://www\.google\.de/search?hl=de&safe=off&q=28+fahrenheit+in+celsius&meta=&aq=f&oq=)http://www\.google\.de/search?hl=de&safe=off&q=28+fahrenheit+in+celsius&meta=&aq=f&oq= [11:39] mikemccand: if i jump in and see that, to use FTE i only have to implement to an interface (TermsEnumFilter) that has getInitialTerm/accept... I think that's more approachable than figuring out the relationship (methods, state) to an abstract parent class [11:39] mikemccand: nice [11:40] mikemccand: ugh, zillions of conflicts on backporting thread safe spellchecker [11:40] ThetaPhi: if the abstract parent class makes all other methods final its the same [11:41] ThetaPhi: but ok, you could change MTQ to just return the interface [11:41] mikemccand: not really the same – you have to define when each overridden method is allowed/supposed to invoke the methods from the parent [11:42] mikemccand: once (if) we make the interaction stateless,we suddenly have the freedom to make it a separate interface... [11:43] ThetaPhi: you mean calling next() inside accept() -> loops forever [11:43] mikemccand: yeah :( [11:43] mikemccand: of course... that freedom of the orig FilteredTermEnum is what made NRQ sneakiness possible in the first place ;) [11:44] ThetaPhi: (which was a hack, you have seen the comment: something like: this relys on how setEnum() works, if this changes in the superclass this enum will no longer work) [11:45] ThetaPhi: this was a comment before i removed the recursion in one of this issues [11:45] mikemccand: right – but it was the right hting to do at the time [11:45] mikemccand: sure was tricky to undertsand :) [11:45] ThetaPhi: (ore remove the recursion, like it is now in trunk) [11:45] mikemccand: you were bound by FTE [11:45] mikemccand: yes, that's better [11:45] mikemccand: but i like separate interface best – it removes all shared state – one thing describes what's filtered, the other implements according to that [11:46] mikemccand: then MTQ could almost simply accept a TermsEnum. the only difference is difference() – ha [11:46] mikemccand: it's only fuzzy that uses difference() right? [11:46] ThetaPhi: yes [11:46] ThetaPhi: i was thinking about that [11:47] ThetaPhi: because a empty prefix filter could simply return the termsenum [11:47] mikemccand: ahh yes nice opto [11:47] ThetaPhi: PrefixQuery("") [11:47] ThetaPhi: would match all documents that have at least any term [11:47] mikemccand: actually how come contrib/queries' TermsFilter isn't a query filter wrapper around an MTQ? [11:47] ThetaPhi: in automaton we need this for the catch all .* case [11:47] mikemccand: ahh [11:48] ThetaPhi: at the moment it generates a PrefixTermsEnum("") [11:48] mikemccand: so maybe AcceptResult has a float difference? [11:48] ThetaPhi: which is slower than just returning the TermsEnum of the reader, just because of difference [11:49] ThetaPhi: because it calls startsWith for each term (which in fact does nothing because termref.length==0 [11:49] mikemccand: ugh [11:49] ThetaPhi: so not that bad [11:49] mikemccand: much better to simple return .terms() for field. or, rewrite to MatchAllDocs [11:49] ThetaPhi: or a termrangequery(null,null) (often used in solr) [11:49] ThetaPhi: no [11:49] ThetaPhi: MatchAllDocs is different [11:50] ThetaPhi: because the Prefix("") case only returns document that have at least one term of that field [11:50] mikemccand: ahhh yes [11:50] ThetaPhi: this is needed for facetting i think in Solr [11:50] ThetaPhi: not sure [11:50] mikemccand: got it [11:50] ThetaPhi: in NRQ its fast, it just enumerates all low-prec terms [11:51] ThetaPhi: 16 for precStep=4 [11:51] mikemccand: nice [11:51] mikemccand: NRQ is a great step forward for lucene [11:51] mikemccand: would be nice if MTQ could simply accept a TermsEnum [11:52] mikemccand: problem is difference() [11:52] ThetaPhi: yes [11:52] ThetaPhi: and is only used by fuzzy [11:52] mikemccand: yes, annoying for just that one case [11:52] mikemccand: contrib's TermsFilter really ought to be an MTQ [11:52] ThetaPhi: because robert asked for a alltermsenum, which is stupied [11:53] mikemccand: yeah [11:53] ThetaPhi: i only said, user prefix("") whoich is ok, the overhead is minimal, but in this case it would be zero [11:53] mikemccand: we could allow MTQ to accept either a TersmsEnum, or, a TermsEnumFilter (this new stateless interface, with AcceptResult also having a float difference field) [11:53] ThetaPhi: aaaah [11:54] ThetaPhi: but that looks like attributesource [11:54] ThetaPhi: differenceattribute [11:54] mikemccand: yeah that's true [11:54] ThetaPhi: termsenums are already with attributes [11:54] mikemccand: curious. so then MTQ could accept only TermsEnum, but, ask for its attrs, and if that's non-null, as for differenceattr, and if htat's non null, use it [11:55] ThetaPhi: fuzzyenum just add this attribute [11:55] mikemccand: i like that! [11:55] ThetaPhi: and fuzzyquery requests it [11:55] mikemccand: and MTQ respects it too [11:55] ThetaPhi: and the default of this attribute is 1.0 [11:55] mikemccand: ye [11:55] mikemccand: yes [11:55] ThetaPhi: so mtq just asks always for this attribute in scoring boolean rewrite [11:56] mikemccand: yes [11:56] mikemccand: and, MTQ is much simplified to accept a TermsEnum for its terms [11:56] ThetaPhi: yes [11:56] mikemccand: and we can make a stateless API for filtering a TermsEnum [11:57] ThetaPhi: by the way, I did not check ever<ything in flex [11:57] mikemccand: that's ok – it's an immense number of changes! [11:57] ThetaPhi: but if you wrap an enum, you must override attributes() to return the attributes of the delegate [11:57] mikemccand: i still keep finding bugs in the emulation layers. [11:57] mikemccand: hmm you're right [11:58] ThetaPhi: so UnionTermsEnum or like so must override attributes() and call delegate.attributes() [11:58] mikemccand: or, fuzzy query could be the only one that does htis [11:58] ThetaPhi: yes [11:58] ThetaPhi: no [11:58] ThetaPhi: let it in MTQ [11:58] ThetaPhi: we yesterday talked, normally rewrite should be final in MTQ [11:58] mikemccand: that sounds good [11:58] ThetaPhi: you should only change behaviour in getEnum [11:58] mikemccand: ok [11:59] ThetaPhi: in trunk we currently rewrite WildCardQuery to PrefixQuery which is itsself an MTQ [11:59] ThetaPhi: totally useless [11:59] ThetaPhi: just return PrefixEnum in getEnum [11:59] mikemccand: ahh right [12:00] ThetaPhi: in flex i changed [12:00] ThetaPhi: the problem are bw tests in trunk, because they check the rewritten thing, but we could simply remove the tests [12:01] mikemccand: tests shouldn't check internals like that [12:01] ThetaPhi: yes [12:01] ThetaPhi: or these checks should be marked [12:01] mikemccand: yes [12:01] ThetaPhi: in Junit4 with an annotation [12:01] mikemccand: ahh what annotation? [12:01] ThetaPhi: and the bw tests only run tests without that annotation [12:02] mikemccand: that sounds great [12:02] ThetaPhi: the problem may then still compilation [12:03] ThetaPhi: if it checks package protected fields and so on [12:03] ThetaPhi: brrr [12:03] ThetaPhi: we have such tests [12:03] mikemccand: sigh. [12:03] mikemccand: yes [12:03] mikemccand: i remember committing them ;) [12:03] *** mikemccand has left #lucene. [12:03] ThetaPhi: you always have to remove them [12:22] *** mikemccand has joined #lucene. [12:22] mikemccand: sorry, dropped off [12:25] ThetaPhi: ok i said [12:25] ThetaPhi: i will try out a little bit [12:25] ThetaPhi: and post in the automaton [12:25] mikemccand: ok [12:25] ThetaPhi: its hard to always generate both pacthes [12:26] ThetaPhi: when we are happy, i will post a patch to 2110 and then we commit [12:26] ThetaPhi_: in automaton there is more special cases to test

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hey guys, in my opinion it would make it easier if we could somehow iterate in the flex branch under this issue.

Do we really need a monster patch right now that is 100% perfect or can we exploit having this branch available to make discussions and review easier?

For example, it seems everyone agrees the current patch here is a good "step forward".

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch that implements the Attributes implementation to power scoring BooleanQuery searches. It changes:

I think thats all. Happy reviewing.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch looks good Uwe – just TestWildcard.testPrefixTerm, on back-compat tests, that's failing. This is a nice simplification of MTQs...

I'm wondering if we can further simplify the "seek" case in FilteredTermsEnum so that it's not a 2 step process (FTE.accept returns YES|NO_AND_SEEK, then, FTE.next calls nextSeekTerm), ie instead accept would return the new seek term in the returned status. This would also make the interaction between FTE and its subclass stateless. But let's take that up under a separate issue. I think this one is ready to go into flex branch?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch:

I think this is committable :-)

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I'll commit this soon after I added some javadocs to BoostAttribute.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Committed revision: 887779

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I forgot: Thank you all, Mike, Robert, Mark for all your suggestions in chat!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here a small update that changed nextSeekTerm to also provide the current TermRef. It will be null on first call. This removes the ugly if (unitialized) code in Automaton.

Also the delegate enum is now private, as its state should be unchangeable by subclasses.

I will now work on a better approach with only accept using an extended AcceptStatus with next seek term). The current patch will now be committed.