apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

Automaton Query/Filter (scalable regex) [LUCENE-1606] #2680

Closed asfimport closed 14 years ago

asfimport commented 15 years ago

Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:

  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

 The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

 1. Look at the portion that is OK (did not enter a reject state in the DFA)
 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.


Migrated from LUCENE-1606 by Robert Muir (@rmuir), resolved Dec 09 2009 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606_nodep.patch, LUCENE-1606.patch (versions: 15), LUCENE-1606-flex.patch (versions: 12) Linked issues:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, lookout for that ø in the NOTICE.txt... I will fix it, thanks for your cleanups :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

fix the ø in NOTICE, cleanup some unused imports, etc.

now that Uwe fixed a performance bug in the dumb enum (it would never set endEnum=true but instead false), I will dig up my old performance tests and see how we are looking.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I reran my tests, Uwe's fix removes this '5%' problem I mentioned before for leading *. Now wildcardquery is always faster than before (before it was comparing terms from another field due to the endEnum bug)

This makes sense, because RunAutomaton.run() is just array access, instead of all the conditional/branching in the old wildcardEquals. But i could not figure out before for the life of me, why this was slower before!

I will create a better benchmark now that generates lots of random numeric wildcards with lots of patterns, and post the results and code.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

this patch fixes a bug i introduced when i removed recursion. the wildcard tests do not detect it... told you i didnt trust them :)

I will add a test for this, although it was an obvious mistake on my part.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is benchmark, which generates random wildcard queries. it builds an index of 10million docs, each with a term from 0-10million. it will fill a pattern such as N?N?N?N? with random digits, substituting a random digit for N.

Pattern Iter AvgHits AvgMS (old) AvgMS (new)
N?N?N?N 10 1000.0 288.6 38.5
?NNNNNN 10 10.0 2453.1 6.4
??NNNNN 10 100.0 2484.2 10.1
???NNNN 10 1000.0 2821.3 47.8
????NNN 10 10000.0 2346.9 299.8
NN??NNN 10 100.0 34.8 6.3
NN?N* 10 10000.0 26.5 9.4
?NN* 10 100000.0 2009.0 73.5
*N 10 1000000.0 6837.4 6087.9
NNNNN?? 10 100.0 1.9 2.3

i would like to incorporate part of this logic into the junit tests, on maybe a smaller index, because its how i found the recursion bug.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Those are impressive gains!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Mike, it is not that impressive really, until you look at regex performance :)

The current regexp implementations will scan entire term dictionary for an expression like "[dl]og?", because there is no 'constant prefix' The idea here, is that lucene should be smart enough to look for do, dog, lo, and log.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

You are the man Robert. This is going to be great.

Next on my wish list is getting the scalable fuzzy done :) We should start a new issue for that, seeding it with the info you have here. If you don't get to it, I'll be happy to.

Still on my list to help with review on this patch too. Thanks Uwe as well! Love seeing this stuff make its way into core.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, yeah lets create a separate issue for fuzzy. I found someone implemented that algorithm in python or some other language, we should look/contact them to see what they did.

currently, I am trying to check this #3151 issue, to see if this caching will help cases where the enum must seek a lot, for example the pattern ????NNN it is still better than the current wildcard case, but you see it gets a lot worse when you have a lot more seeks.

i think though, this means i have to cut over to the new FilteredTermsEnum api for the flex branch... which looks interesting btw but this is a complicated enum.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, one last comment. I want to mention this impl is largely unoptimized (from a code perspective, but the algorithm is better) I think you see that from the NNNNN?? being 2.3ms on average versus 1.9, not that I am sure that isnt just a random hiccup.

So I want to incorporate the logic of some of this benchmark into the tests, so that we can improve the actual code impl. to speed up cases like that. While i focus on the scalability, i know a lot of people have small indexes and maybe lots of qps and I don't want to slow them down.

Some of this is easy, for example we make State.getSortedTransitionArray public, so we don't have to convert from arrays to lists to arrays and such, for no good reason.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Are we going to deprecate contrib/regex with this?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Are we going to deprecate contrib/regex with this?

I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Are we going to deprecate contrib/regex with this?

I would argue against that, only because the other regex impl's have different features and syntax, even if they are slow.

Ahh OK, I agree then. I didn't realize this new query doesn't subsume contrib's. Would be good to call out what's different in the javadocs somewhere...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Would be good to call out what's different in the javadocs somewhere...

good idea, let me know if you have some suggested wording. its really a general issue, the supported features and syntax of even the existing regex implementations in contrib are different I think? (i.e. they are not compatible: you cannot just swap impls around, without testing that it supports the syntax and features you are using)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I don't have any wording – I don't really know the differences :)

If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

If it's "only" that the syntax is different, that's one thing... but if eg certain functionality isn't possible w/ new or old, that's another.

from a glance, it appears to me that both the syntax and functionality of our two contrib impls (java.util and jakarta), are very different.

here is one example. Java.util supports reluctant {m,n} closures, jakarta does not, it says this right in the javadocs. http://jakarta.apache.org/regexp/apidocs/org/apache/regexp/RE.html

Should RE support reluctant {m,n} closures (does anyone care)? But it supports reluctant versus greedy for other operators.

in automaton, this concept of reluctance versus greedy, does not even exist, as spelled out on their page: The * operator is mathematically the Kleene star operator (i.e. we don't have greedy/reluctant/possesive variants). http://www.brics.dk/automaton/faq.html

this is an example, where all 3 are different... i guess i kinda assumed everyone was aware that all these regex packages are very different.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

we call this out nicely in the current RegexQuery, The expressions supported depend on the regular expression implementation used by way of the RegexCapabilities interface.

what should I say for the automaton implementation? it already has a javadoc link to the precise syntax supported, so in my opinion its actually less ambiguous than contrib RegexQuery.

but maybe improve this, instead of

The supported syntax is documented in the {`@link` RegExp} class.

maybe:

The supported syntax is documented in the {`@link` RegExp} class.
warning: this might not be the syntax you are used to!
asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

OK that warning seems good. Maybe also reference contrib/regex, as another alternative, nothing that syntax/capabilities are different?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

First cut @ cutting over to flex API attached – note that this applies to the flex branch, not trunk!

I made some small changes to the benchmarker: use constant score filter mode, and print the min (not avg) time (less noise).

Also, I ported the AutomatonTermEnum to the flex API, so this is now a better measure ("flex on flex") of what future perf will be. It's possible there's a bug here, though TestWildcard passes.

I still need to investigate why "non-flex on non-flex" and "non-flex on flex" perform worse.

I ran like this:

java -server -Xmx1g -Xms1g BenchWildcard

java is 1.6.0_14 64 bit, on OpenSolaris.

Results (msec is min of 10 runs each);

Pattern ITrunk (min msec) (Flex (min msec)
N?N?N?N0.0 13 18
?NNNNNN 1 3
??NNNNN 4 6
???NNNN 23 28
????NNN 210 170
NN??NNN 3 3
NN?N* 7 4
?NN* 62 30
*N 4332 2576
NNNNN?? 1 1

Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Looks like flex API is faster for the slow queries. Once I fix caching on trunk we should retest...

Mike, this is cool. I like the results. it appears tentatively, that flex api is faster for both "dumb" (brute force linear reading) and "fast" (lots of seeking) modes. at least looking at ????NNN, and *N, which are the worst cases of both here. So it would seem its faster in every case.

I'll look at what you did to port this to the TermsEnum api!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, I think your port to TermsEnum is correct, and its definitely faster here.

One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String? Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem)

I use only string because of the old TermEnum api.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

One question, is it possible to speed this up further, by using UnicodeUtil/char[] conversion from TermRef instead of String? Because its trivial to use char[] with the Automaton api (even tho that is not exposed, its no problem)

I use only string because of the old TermEnum api.

Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Oh, that'd be great! It would be faster. I was bummed at how many new String()'s I was doing...

it would be nice I think if TermRef provided a helper method to make the char[] available? i.e. I don't think i should do unicode conversion in a multitermquery?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

it would be nice I think if TermRef provided a helper method to make the char[] available?

I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I agree... though, this requires state (UnicodeUtil.UTF16Result). We could lazily set such state on the TermRef, but, that's making TermRef kinda heavy (it's nice and light weight now). Hmmm.

i guess the state could be in the TermsEnum, but that doesnt make for general use of TermRef. what else uses TermRef?

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Besides TermsEnum.. TermRef is used by the terms dict, when doing the binary search + scan to find a term. And also by TermsConsumer (implemented by the codec, and used when writing a segment to the index).

Maybe MTQ holds the state, or FilteredTermsEnum? Other consumers of TermsEnum don't need to convert to char[].

We can discuss this under the new [separately] "optimization" issue for MTQs?

Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

We can discuss this under the new [separately] "optimization" issue for MTQs?

is there a jira issue for this??

Also, remember that the current API is doing not only new String() but also new Term() when it enums the terms, so having to do new String() for MTQs on flex API is OK for starters.

oh yeah, its clear the flex api is already better, from benchmarks. I am just trying to think of ways to make it both faster, at the same time easy too.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

is there a jira issue for this??

I thought you were about to open one!

I am just trying to think of ways to make it both faster, at the same time easy too.

Which is great: keep it up!

Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I thought you were about to open one!

I opened one for Automaton specifically, should i change it to be all MTQ?

Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).

nope. because unicode and java are optimized for UTF-16, not UTF-32. so we should use char[], but use the codePoint apis, which are designed such that you can process text in UTF-16 (char[]) efficiently, yet also handle the rare case of supp. characters. char[] is correct, its just that we have to be careful to use the right apis for processing it. With String() a lot of the apis such as String.toLowerCase do this automatically for you, so most applications have no issues.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Actually... wouldn't we need to convert to int[] (for Unicode 4) not char[], to be most convenient for "higher up" APIs like automaton? If we did char[] you'd still have to handle surrogates process (and then it's not unlike doing byte[]).

I wanted to make another comment here, I agree that this somewhat like byte[]. But there are some major differences:

  1. the java API provides mechanisms in Character, etc for processing text this way.
  2. lots of stuff is unaffected. for example .startsWith() is not broken for supp characters. it does not have to use codepoint anywhere, can just compare chars, which are surrogates, but this is ok. so lots of char[]-based processing is already compatible, and completely unaware of this issue. this is not true for byte[]
  3. it will perform the best overall, its only needed in very few places and we can be very careful where we add these checks, so we don't slow anything down or increase RAM usage, etc.
asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

I opened one for Automaton specifically, should i change it to be all MTQ?

Oh, sorry no, just automaton.

nope. because unicode and java are optimized for UTF-16, not UTF-32.

OK char[] it is!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc) Also, I converted the enum to use char[] as an experiment. i posted the results on #3166. this is just a hack, it stores the UTF16Result in AutomatonEnum I figured i would pass it back in case, just in the case you wanted to play more.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike, here is an update to your flex patch. I restored back two tests that disappeared (TestRegexp, etc)

Woops, thanks! Need svn patch, badly...

Also, I converted the enum to use char[] as an experiment. i posted the results on LUCENE-2090. this is just a hack, it stores the UTF16Result in AutomatonEnum I figured i would pass it back in case, just in the case you wanted to play more.

Wow, not "new String()"ing all over gave a sizable gain on the full linear scan query...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Michael, the problem is this code (automaton itself), like many other code, is unaware of supplementary characters. It uses a symbolic interval range of 'char' for state transitions. But this is ok! When matching an input string with suppl. characters, things work just fine.

This is one reason why i am concerned about the change to byte[] in flex branch. I would have to rewrite this DFA library for this enum to work!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike, just one comment here.

I am definitely willing to do refactoring here to support this byte[] scheme if necessary, I don't want to give the wrong impression. I think i already have an issue here related to UTF-16 binary order vs UTF-8 binary order that I need to fix, although I think this is just writing a Comparator.

edit: pretty sure this exists. If someone has say, both data from Arabic Presentation forms and Chinese text outside the BMP in the index, the "smart" enumerator will unknowningly skip right past the Arabic Presentation forms block, because it sorts after the lead surrogate in UTF-16 order, but before the entire codepoint in UTF-8/UTF-32 order. I have not experienced this in practice, because i normalize my text so i don't have stuff in Arabic Presentation forms block :) I can fix this, but i would like to see what the approach is for the flex branch, as its sufficiently compex that I would rather not fix it twice.

I am just concerned about other similar applications outside of lucene, or some already in lucene core itself!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think i have a workaround for this enum that will not hurt performance as well. There are two problems, one exists with the existing api, one will become a problem with the flex API if we move to byte[] TermRef, which, from performance numbers, it seems we almost certainly should.

I'll fix these problems, by providing a new "codepoint-order" comparator for transitions behind the scenes in automaton, along with a getSortedTransitionsCodepointOrder() or something similar to make the whole thing work.

it might seem at a glance that using 'int' (UTF-32 intervals) instead is a better fix, but this is not true, because it would cause a RunAutomaton to use 1MB memory where it currently uses 64KB, only for these stupid rare cases.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I spent a while with this, thinking I would be slick and create a version of Automaton that works with both trunk and flex branch correctly. Finally, i figured it out, this is not possible.

There is no bug with the current version, because in trunk, IndexReader.terms() uses UTF-16 binary order. In the flex branch, it uses UTF-8 binary order.

I can emulate UTF-8 binary order in the enum, but then it won't work correctly on trunk, but will work on flex branch!

This enum is sensitive to the order of terms coming in...

doh!

asfimport commented 14 years ago

Yonik Seeley (@yonik) (migrated from JIRA)

geeze... maybe we should have just stuck with CESU-8 ;-)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yonik, maybe we can use this trick?

UTF-8 in UTF-16 Order The following comparison function for UTF-8 yields the same results as UTF-16 binary comparison. In the code, notice that it is necessary to do extra work only once per string, not once per byte. That work can consist of simply remapping through a small array; there are no extra conditional branches that could slow down the processing.

int strcmp8like16(unsigned char* a, unsigned char* b) {
  while (true) {
  int ac = *a++;
  int bc = *b++;
  if (ac != bc) return rotate[ac] - rotate[bc];
  if (ac == 0) return 0;
  }
}

static char rotate[256] =
{0x00, ..., 0x0F,
0x10, ..., 0x1F,
. .
. .
. .
0xD0, ..., 0xDF,
0xE0, ..., 0xED, 0xF0, 0xF1,
0xF2, 0xF3, 0xF4, 0xEE, 0xEF, 0xF5, ..., 0xFF};

The rotate array is formed by taking an array of 256 bytes from 0x00 to 0xFF, and rotating 0xEE and 0xEF to a position after the bytes 0xF0..0xF4. These rotated values are shown in boldface. When this rotation is performed on the initial bytes of UTF-8, it has the effect of making code points U+10000..U+10FFFF sort below U+E000..U+FFFF, thus mimicking the ordering of UTF-16.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated patch:

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

sorry, my ide added a @author tag. i need to look to see where to turn this @author generation off for eclipse.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

what is UTF-38? :-) I think you mean UTF-32, if such exists.

Else it looks good!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

i think there is one last problem with this for flex branch, where you have abacadaba\uFFFC, abacadaba\uFFFD and abacadaba\uFFFE in the term dictionary, but a regex the matches say abacadaba[\uFFFC\uFFFF]. in this case, the match on abacadaba\uFFFD will fail, it will try to seek to the "next" string, which is abacadaba\uFFFF, but the FFFF will get replaced by FFFD by the byte conversion, and we will loop.

mike i don't think this should be any back compat concern, unlike the high surrogate case which i think many CJK applications are probably doing to...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, where do you see UTF-38 :)

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Uwe, where do you see UTF-38

Patch line 6025.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

about the cleanupPrefix method: it is only used in the linear case to initially set the termenum. What happens if the nextString() method returs such a string ussed to seek the next enum?

look at the code to nextString() itself. it uses cleanupPosition() which works differently.

when seeking, we can append \uDC00 to achieve the same thing as seeking to a high surrogate. when using a prefix, we have to truncate the high surrogate, because we cannot use it with TermRef.startsWith() etc, it cannot be converted into UTF-8 bytes. (and we can't use the \uDC00 trick, obviously, or startsWith() will return false when it should not)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Patch line 6025.

Thanks for reviewing the patch and catching this. I'm working on trying to finalize this. It already works fine for trunk, but I don't want it to suddenly break with the flex branch, so I'm adding a lot of tests and improvements in that regard. The current wildcard tests aren't sufficient anyway to tell if its really working. Also, when Mike ported it to the flex branch, he reorganized some code some in a way that I think is better, so I want to tie that in too.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Did he changed the FilteredTermEnum.next() loops? if yes, maybe the better approach also works for NRQ. I am just interested, but had no time to thoroughly look into the latest changes.

I am still thinking about an extension of FilteredTermEnum that works with these repositioning out of the box. But I have no good idea. The work in FilteredTerm*s*Enum is a good start, but may be extended, to also support something like a return value "JUMP_TO_NEXT_ENUM" and a mabstract method "nextEnum()" that returns null per default (no further enum).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

No, the main thing he did here that i like better, is that instead of caching the last comparison in termCompare(), he uses a boolean 'first'

This still gives the optimization of 'don't seek in the term dictionary unless you get a mismatch, as long as you have matches, read sequentially' But in my opinion, its cleaner.