Automaton Query/Filter (scalable regex) [LUCENE-1606]

asfimport commented 15 years ago

Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:

lexicography/etc on large text corpora
looking for things such as urls where the prefix is not constant (http:// or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

 The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

 1. Look at the portion that is OK (did not enter a reject state in the DFA)
 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

Migrated from LUCENE-1606 by Robert Muir (@rmuir), resolved Dec 09 2009 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606_nodep.patch, LUCENE-1606.patch (versions: 15), LUCENE-1606-flex.patch (versions: 12) Linked issues:

3186
- 3187
- 3166

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, sure. I will bring this patch up to speed (java 5, etc)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch to trunk:

add support for optional regex features
remove recursion
improve performance for worst-case regexp/wildcard/FSM
improved docs & test
remove the fuzzy impl, NFA->DFA too slow for this, maybe a later addition.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

if anyone can spare a sec to take a glance/review before 3.0, i think its ok...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

if no one objects, i'd like to commit this in a few days. Can someone help out and commit the update to NOTICE?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

No prob! I will help you, I am on heavy committing :-)

asfimport commented 14 years ago

Grant Ingersoll (@gsingers) (migrated from JIRA)

Why are new features going into 3.0? I was under the impression that 3.0 was just supposed to be cleanup plus Java 1.5

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Grant, I thought it was ok from Uwe's comment:

I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1).

I guess now I am a little confused about what should happen for 3.0 with contrib in general? No problem moving this to 3.1, let me know!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

3.0 is just the switch to 1.5 and generics. So this is a typical java 1.5 issue and can go into 3.0 even if it is a new feature. Contrib is not core and may have own rules.

In my opinion, this would be a nice addition to the regex contrib and should also have been in 2.9, but the underlying library is Java 5 only, so we had to wait until 3.0.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it :)

You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

Mark, some notes on size. jarring up the full source code (no paring) is 81KB. in practice, the jar file is larger because it contains some 'precompiled DFAs' for certain things like Unicode blocks, XML types... are these really needed?

see here for a list of what I mean: http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html I enabled these in the patch (they could be easily disabled): an example of how they are used in a regexp is like this: <Arabic>* (match 0 or more arabic characters)

if a user really wanted them, they can load them themselves, you can also create custom ones and use a DataTypesAutomatonProvider to register them for some name: Example: your users want to be able to use <make> or <model> inside their regexps, you can register <make> and <model> to match to some DFA you make yourself. its really a nice mechanism, but I don't think we need all the precompiled ones?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

You should also post the info about that Fuzzy possibility you were mentioning - perhaps a math head will come along and take care of that for us with the proper setup.

Right, i created a FuzzyQuery that builds in the 'naive' method. The problem is that for large strings this exponential-time naive mechanism creates a rather large NFA, and the NFA->DFA conversion is very slow. Once the DFA is built, actually running it on a term dictionary is fast :) So the slow part has nothing to do with lucene at all.

So we just need to build these DFAs in an efficient way: We show how to compute, for any fixed bound n and any input word W , a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.

but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

By the way Mark, in case you are interested, the TermEnum here still has problems with 'kleene star' as I have mentioned many times. So wildcard of ?abacadaba is fast, wildcard of *abacadaba is still slow in the same manner, regex of .abacadaba is fast, wildcard of .*abacadaba is still slow.

No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

but there are algorithms to reverse an entire dfa, so you could use ReverseStringFilter and support wildcards AND regexps with leading * I didnt implement this here though yet.

Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No problem in my mind - nothing the current WildcardQuery doesn't face. Any reason we wouldn't want to replace the current WCQ that with this?

I don't think there is any issue. by implementing WildcardQuery with the DFA, leading ? is no longer a problem, i mean depending on your term dictionary if you do something stupid like ???????abacadaba it probably wont be that fast.

I spent a lot of time with the worst-case regex, wildcards to ensure performance is at least as good as the other alternatives. There is only one exception, the leading * wildcard is a bit slower with a DFA than if you ran it with actual WildcardQuery (less than 5% in my tests) Because of this, currently this patch rewrites this very special case to a standard WildcardQuery.

Now that sounds interesting - now sure I fully understand you though - are you saying we can do a prefix match, but without having to index terms reversed in the index? That would be very cool.

No, what I am saying is that you still have to index the terms in reversed order for the leading * or .* case, except then this reversing buys you faster wildcard AND regex queries :)

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think - though you've taken care of that anyway - so sounds like gold to me. I didn't expect this to solve leading wildcard issues, so no loss to me.

No, what I am saying is that you still have to index the terms in reversed order for the leading or . case, except then this reversing buys you faster wildcard AND regex queries

bummer :) Does it make sense to implement here though? Isn't the ReverseStringFilter enough if a user wants to go this route? Solr's support for this is fairly good, but I don't think it needs to be as 'built in' for Lucene?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Does it make sense to implement here though?

I do not think so. I tested another solution where users wanted leading * wildcards on 100M+ term dictionary. I found out what was acceptable (clarification: to these specific users/system) was for * to actually match .{0,3} (between 0 and 3 of anything), and rewrote it to an equivalent regex like this. This performed very well, because it can still avoid comparing many terms.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

That is a cool tradeoff to be able to make.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

That is a cool tradeoff to be able to make.

Mark, yes. I guess someone could implement the DFA-reversing if they wanted to, to enable leading .* regex support with ReverseStringFilter. you can still use this Wildcard impl with ReverseStringFilter just like the core Wildcard impl, because its just so easy to reverse a wildcard string.

but you don't want to try to reverse a regular expression! that would be hairy. easier to reverse a DFA.

but even without this, there are tons of workarounds, like the tradeoff i mentioned earlier. also, another one that might not be apparent is that its only the leading .* that is a problem, depending on corpus of course.

[a-z].*abacadaba will avoid visiting terms that start with 1,2,3 or are in chinese, etc, which might be a nice improvement. of course if all your terms start with a-z, then its gonna be the same as entering .*abacadaba, and be bad.

all depends on how selective the regular expression is wrt your terms.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

So Robert - what do you think about paring down the automaton lib, and shoving all this in core? I want it, I want, I want it

I think trying this out around in contrib (after 3.0 is released) would be best in the short term?

Separately, my quickly 'pared' automaton library is now 53KB jar (14 java files, some just simple POJO) Do you have a target size I should shoot for?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think trying this out around in contrib (after 3.0 is released) would be best in the short term?

What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough?

Do you have a target size I should shoot for?

As small as possible ;) But I don't personally have any issue adding 53k to the core jar for this goodness. Guess we will have to see what others say - but its a low percentage of the current 1.1 MB, and pretty sweet functionality.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

What are your concerns? If it passes the current wildcard tests and survives in trunk for a dev cycle, isn't that likely enough?

i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug? :)

As small as possible

ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

i don't really have any, except that I don't necessarily trust the current wildcard tests. Shouldn't they have detected 2.9.0 scorer bug?

If they caught that, they wouldn't catch another :) How do you want to improve them? I'll help test and write tests - we can make something much more intensive if you'd like, and then just put a flag to tone it down for normal test running.

ok, i will work at this some more. obviously i could pare it down to just what we are using, but i am trying to preserve 'reasonable' functionality that might be handy elsewhere.

Right - don't go further than makes sense - even 53k -> 20k - I don't think it really matters that much. So really I meant, as small as makes reasonable sense.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

How do you want to improve them?

well for one, i test all the rewrite methods and boosts here. ok these are also now fixed as of 3.0 in core wildcard tests also (#3026), but those were two 'buglets', just an example.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Point taken - the tests are not perfect. They never are. But it doesn't stop us chugging along :) We can always write more tests and trunk tends to get quite a work out if you put changes in towards the beginning of a dev cycle. Bugs are inevitable, in trunk or contrib. But I don't think the wildcard impl will get much exposure in contrib anyway - its not wired into the queryparser, and it won't come with a sign saying check this out. Users will still use the standard wildcardquery - and I want to see it improved. We can work out the patch, work out the tests, and then decided its not good enough - or perhaps another committer will look and decided that. I'd still love to put it in core myself.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, ok. I will supply a new patch with no lib dependency, instead it includes the pared Automaton code in one pkg. this compiles to about a 48KB jar right now. Reducing it more would involve sacrificing readability or useful stuff.

Example: keeping the "Matcher" would be useful if you want to use this for a really fast 'PatternTokenizer', but not needed here.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is an alternate patch with no library dependency (LUCENE-1606_nodep.patch) instead it imports 'pared-down' automaton source code (compiles to 48KB jar) it is still setup in contrib regex because...

Mark: some practical questions, I'd like to create a patch that integrates it nicely into core, just so we can see what it would look like. Thoughts on class names and pkg names?

I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?
but then what should AutomatonRegexQuery be called, we already have RegexQuery :)

Thoughts on the automaton src code? Should I reformat to our style... (I did not do this). should we rename the pkg?

sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)? also, i did not remove any tests, for example, TestWildcardQuery already exists, so the test here is just duplication, i just might add a test or 2 to the existing TestWildcardQuery

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I assume we should nuke the old WildcardQuery, rename AutomatonWildcardQuery to WildcardQuery?

Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery?

but then what should AutomatonRegexQuery be called, we already have RegexQuery?

Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery.

Thoughts on the automaton src code? Should I reformat to our style... (I did not do this).

Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file.

should we rename the pkg?

I think so - perhaps util.brics? No need for dk I don't think.

sorry the patch is monster, if it makes it any easier i could split the automaton library itself away from the lucene integration (queries, etc)?

One large patch is fine with me - my IDE will make short work of groking it :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yes - I think so - but how to handle the fact that you fall back to it? We might either rename it or incorporate it into the new WildcardQuery?

We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%). I was actually surprised the wildcardquery logic beats a DFA, i guess something to be said for that hairy logic :)

Shouldn't this one eventually make the old obsolete? I say we name it RegexQuery.

I do not know, all regex is not created equal. This one has different syntax and stuff from the other impl's. Any other ideas? Obviously the name RegexpQuery, with a p, is available

Yup - I think we should reformat and drop the author tags. We can mention that type of info in the NOTICE file.

ok, this is easy, i already have NOTICE in the patch. i was sure all files from brics have their license header also.

I think so - perhaps util.brics? No need for dk I don't think.

o.a.l.util.brics?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

We could just remove the .rewrite(). it is only that very special case, for leading *, where the existing WildcardQuery logic is slightly faster (< 5%).

Agreed - not worth the extra code for speeding up such a horrible case by 5%.

Any other ideas?

I'd rather change the contrib names and let core have the good name :) We can start with RegexpQuery I think.

o.a.l.util.brics?

Thats my best thought at the moment.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

OK we have the start of a plan, only one final nit I am worried about.

I pared away the 'built-in named automata':

example <Lu> (uppercase letter, from Unicode)
example <QName> (from XML)

if we keep the original pkg name, a user can have these just by adding brics.jar into their path. They would just pass new DatatypesAutomatonProvider() to the constructor of RegexpQuery, done.

if we rename the pkg, this will not work because the DataTypesAutomatonProvider from the jar file implements dk.brics.automaton.AutomatonProvider, not o.a.l.util.brics.AutomatonProvider.

alternatively, we could rename the pkg, but I could restore perhaps a subset of these datatypes, maybe without all the xml ones, just the basic unicode categories and stuff? This would cost a little space though. Here is the list: http://www.brics.dk/automaton/doc/dk/brics/automaton/Datatypes.html#get%28java.lang.String%29

i ask this question because personally i don't use any of these built-ins, but users might want them? what do you think?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

On the way hand I'd say, well lets not rename the package then - its not that important. But these things could get out of sync anyway, so I'm not sure its worth it to try and maintain some sort of compatibility. If these features are useful enough, we could end up adding them later. Your call though. Personally, I'd think we start just by adding the essentials and build from there as makes sense.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, ok. In that case I will not include these, and rename the pkg as you suggest. These default named automata are not enabled by default in the library anyway if you use the RegExp() default constructor. the (renamed and pared) api is still extensible, if you want to create named automata to use in your regular expressions, you just implement the very simple interface DatatypesAutomatonProvider. then you pass this to the constructor of RegexpQuery

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Okay - still not an issue I don't think - leading wildcards are already an issue - 5% is worth the other speedups I think

Mark, the old WildcardTermEnum is public, so we must keep it around for a while anyway. I can use it for this case, so we don't lose this 5% in the special case :)

Might be worth deprecating this old WildcardTermEnum still though, just because its code to be maintained, hardly used except for this purpose.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, I think this patch is ok, all tests pass etc. Can you take a look and let me know your thoughts?

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Nice! Resulting jar is still just 1.0 MB. Looks great on a quick look through. I'll go over more thoroughly when I get a chance.

As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new.

+1 on the automaton name for the util package.

I'd almost still prefer RegexQuery - its contrib vs core and different packages - I hate to lose out on the better name. Though thats a bit subjective ;)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, thanks, let me know if you have the chance to look more thoroughly.

I agree, lets consider some ideas for testing wildcards, yours sounds good. One problem I had is trying to figure out: "what is the average/common case" for wildcards/regex :) Its important also when considering some additional optimizations i havent yet implemented.

also, I think i might have some additional wildcards tests from the contrib patch. I left TestWildCard completely as-is for now tho, b/c i thought it would be nice to show it passes unchanged.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I like it, too, some thoughts:

Maybe make AutomatonTermEnum public instead package private (if it maybe extended and of usage for own sub classes like a future FuzzyQuery to return a difference())
The code in WildcardTermEnum is deprecated but still there and teherefor duplicated functionality. Maybe we could make this class subclass of AutomatonTermEnum, but it initializes to be a simple WildCard. The TermEnum has no longer a test (the deprecated one), so we maybe must add a test.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

As far as testing, one of the simple things we can try is generating random wildcard strings against a large corpus and auto comparing the results of the old and the new.

An idea i have here is ORP-2 corpus, it has approx 417K unique terms and 160K docs. and its open, so anyone could participate. :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, both your ideas are great. thank you for looking. I will take a stab at those.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor.

  public WildcardTermEnum(IndexReader reader, Term term) throws IOException {
    super(WildcardQuery.toAutomaton(term), term, reader);
  }

The problem is that this hardly removes any duplicated code, because we must keep this available:

 public static final boolean wildcardEquals(String pattern, int patternIdx,
    String string, int stringIdx)
  {

This is where all the logic really is anyway. So I think i would rather leave this one alone? But I will add a test for it, to ensure it doesn't break since we are not using it. What do you think?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Uwe, i looked at the WildcardTermEnum and it was easy to make it a subclass, with no logic, just a ctor.

That was my idea!

This is where all the logic really is anyway.

We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code".

The good thing: the method is final (this is what I see from yor fragment) - so nobody was able to override it to change the behaviour of the enum, so nothing can break.

I would go this way.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We should simply add a test for this method and everything else is the WildCardEnum. The good thing of subclassing it is, that one has a more performat class if it uses common prefixes and so on than the version we currently have. The wildcardEquals method must stay, but it is not used, so explicitely mark it as "dead code".

if we do this, there are lots of cases where it will perform better, yes (virtually anything involving ? operator) but if we do this, there are also some cases where it won't perform quite as well, really bad wildcards where it is better to just do linear scan than skip around many many times. This is why i have detection for these cases, in the getEnum() instead I return "DumbTermEnum" aka LinearTermEnum in AutomatonQuery. if you think this is no problem, we can subclass it anyway. excerpt below:

    /*
     * If the DFA has a leading kleene star, or something similar, it will
     * need to run against the entire term dictionary. In this case its much
     * better to do just that than to use fancy enumeration.
     * 
     * this heuristic looks for an initial loop, with a range of at least 1/3
     * of the unicode BMP.
     */
    State state = automaton.getInitialState();
    for (Transition transition : state.getTransitions())
      if (transition.getDest() == state
          && (transition.getMax() - transition.getMin()) > (Character.MAX_VALUE / 3))
        return new LinearTermEnum(reader);

    return new AutomatonTermEnum(automaton, term, reader);

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

by the way Uwe, I do not particularly like how this AutomatonQuery "decides to use smart or dumb termenum" in getEnum() works. I wish instead the AutomatonTermEnum would always be fast, instead of relying on the query to decide. I think this would be cleaner, and make subclassing easier.

but on the other hand, having these two separate, it makes things easy to understand, as the two methods work in two completely different ways. i wonder if you have any ideas on this.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

see #3151, why it is not so fast (the setEnum calls use seeking and this is not optimized by the TermCache). Yonik has poited us to that.

If the dumb enumeration would be included inside AutomatonTermEnum, one could use it without thinking. I would like to move your posted code into AutomatonTermEnum and have two modi dumb and intelligent. This would need an if switch on each next() call and a delegation to super.next(). That would make the enum ugly... But would work. So just fold the LinearTermEnum into it and make a switch: if (linearMode) return super.next(); But you have to remove the assert inside endEnum() and change it. In the intelligent case, the endEnum method is never called (because super.next() is never called). So the assert must be assert linearMode; termCompare looks identical in both enums, for the indelligent case the comonPrefix is "".

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

That would make the enum ugly... But would work.

This is why i did not do it (i tried and it was ugly), i did not want to make a complicated enum ugly! I'll try to think of how this can be done without it being so ugly.

edit, by the way Uwe, if you are bored and want to take a stab at this :) You know your way around multitermquery better than me.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Hi Robert,

here is my patch. The WildCard and RegExp test querys still pass. I also added a test for the deprec TermEnum (just a simple MTQ that returns it is used and should produce same results as WildcardQuery).

The AutomatonTermEnum now switches between smart(R) and non-smart mode using your detection algorithm. termCompare now handles both cases. next() just calls super in the linear case (so it behaves like a normal FilteredTermEnum) and uses the smart(R) code in all other cases.

I will go to bed now, tell me if you like it.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, thank you. This is much nicer!

I think now it will be easier for some subclass to extend this enum, for example to override difference() or whatever for fuzzy.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I'm not following this very closely, but, it looks really really cool!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Some cleanups and a more consistent endEnum handling. Also added Javadocs explaining smart and linear mode.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Again some updates, moved the '*' and '?' constants also to WildcardQuery and use them in switch. Also added better deprecation messeg to WildcardTermEnum.

apache / lucene

Automaton Query/Filter (scalable regex) [LUCENE-1606] #2680

3186

3187

3166