Automaton Query/Filter (scalable regex) [LUCENE-1606]

asfimport commented 15 years ago

Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:

lexicography/etc on large text corpora
looking for things such as urls where the prefix is not constant (http:// or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

 The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

 1. Look at the portion that is OK (did not enter a reject state in the DFA)
 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.

Migrated from LUCENE-1606 by Robert Muir (@rmuir), resolved Dec 09 2009 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606_nodep.patch, LUCENE-1606.patch (versions: 15), LUCENE-1606-flex.patch (versions: 12) Linked issues:

3186
- 3187
- 3166

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

OK, so doesn't affect NRQ, as it uses a different algo

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Yeah, but in general I think I already agree that FilteredTerm*s*Enum is easier for stuff like this.

Either way its still tricky to make enums like this, so I am glad you are looking into it.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I think the approach with nextEnum() would work for Automaton and NRQ, because both use this iteration approach. You have nextString() for repositioning, and I have a LinkedList (a stack) of pre-sorted range bounds.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

And I could still use this with "dumb mode"?, just one enum, right?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

yes.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

this patch removes constant prefix, as its only used in dumb mode, and in dumb mode there is no constant prefix. instead its replaced with constant suffix, which speeds up comparisons.

constant suffix is implemented naively as reversing the DFA, taking its constant prefix, then reversing that.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Responding from #3151...

So I guess if there's a non-empty common suffix you should just always seek?

the suffix is just for quick comparison, not used at all in seeking.

I think I'm confused – if the query is ???1234, the common suffix is 1234, and so shouldn't the DFA tell us the next XXX1234 term to try to seek to (and we should never use next() on the enum)?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think I'm confused - if the query is ???1234, the common suffix is 1234, and so shouldn't the DFA tell us the next XXX1234 term to try to seek to (and we should never use next() on the enum)?

not really the suffix, but more general, the structure of the dfa itself tells us that for ???1234, if you are evaluating 5551234, that the next term should really be 5561234 you can tell this by basically 'walking the graph'

but this knowledge is not always available, sometimes we only have 'partial' knowledge of where to go next. The culprit here is the * operator, because it eats up anything :) So when you walk the graph, the * operator is this giant monster in your way.

so sometimes, depending on the dfa (which depends on the wildcard or regex used to construct it), automaton knows enough to seek to all the exact locations, if its finite (like ???1234)

sometimes, it only knows partial information. when a loop is encountered walking the graph (the giant monster), it has to stop and only use the path information it knows so far. for example a wildcard of abcd*1234, the best it can do is seek to abcd.

your description of ping-pong is right, this is how these situations are handled.

right now, the way the enum works, is i don't even bother seeking until i hit a mismatch. you can see this in the comments 'as long as there is a match, keep reading. this is an optimization when many terms are matched sequentially like ab*'

i tested this along time ago, perhaps we should re-test to see if its appropriate?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

mike, here is a more complex example of the ping-pong:

a wildcard of abcd*12?4

it has to seek to abcd first because of the * (the loop stops me) the ping-pong from the term dictionary, returns say abcdk1000 it moves me past the giant monster. now i know enough to seek to abcdk12\u00004

in this case its nice, the TermEnum moved me past it in one seek. sometimes its not so nice, if the TermEnum gave me abcda, i'm not past the monster.

all i can do is generate the next possible term that won't put me into a DFA reject state, which is abcda\u0000... forcing the enum to move me forwards. maybe i seek to abcda\u0000, and it gives me abcdaa back, ill do the same thing again.

the reason is, somewhere down the line there could be abcdaaaaaaaaaaaaaaaaaaaaaaaaaa1234 :)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But, take the abcd*1234 case – you first seek to abcd, the terms enum finds (say) abcdX1234 – don't you (DFA) know at this point that next possible candidate is abcdY1234? Ie, you should seek to that term? (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close").

That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But, take the abcd*1234 case - you first seek to abcd, the terms enum finds (say) abcdX1234 - don't you (DFA) know at this point that next possible candidate is abcdY1234? Ie, you should seek to that term? (Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close").

pretty sure I know this information, but in general i only seek on mismatches, I think for the reason that there can be a lot of seeks for AB* (say 1 million terms). so instead i wait for a mismatch until i seek again, I think tests showed this due to what you mentioned below?

That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.

I think that might be whats killing me, when i ran tests on lucene 2.9 or whatever. We should retest performance on flex.

This is why i said, significant rework of this maybe should take place in flex (although I still think this is an improvement for trunk already), to fully take advantage of it.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I guess here is the big question Mike, pretend ab* isn't rewritten to a prefixquery (it is, but there are more complex examples like this that cannot be)

is it faster to seek 1M times and get the 1M terms, or just read them sequentially? furthermore to "seek" is not just lucene seek, I have to walk the transitions and compute the next place to go... (and create a few objects along the way)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

pretend ab* isn't rewritten to a prefixquery (it is, but there are more complex examples like this that cannot be) is it faster to seek 1M times and get the 1M terms, or just read them sequentially?

This case should definitely be done sequentially.

But the fixed trailing prefix case I think should be all seeks.

That said, seeking on trunk is alot more costly than seeking on flex, because trunk has to make a new [cloned] SegmentTermEnum for each seek.

I think that might be whats killing me, when i ran tests on lucene 2.9 or whatever. We should retest performance on flex.

Well, the seeks need to be done anyway... so you can't work around that. The only question is if a wasted next() was done before each, I guess...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But the fixed trailing prefix case I think should be all seeks.

I'll use regular expressions here, just to elaborate on this.

what if its ab.[ab]? but the ab.[a-z]? ... where do you draw the line :)

its also worth mentioning that the automaton "seek", nextString() is a lot more heavyweight right now than its "compare", which is extremely fast. its the DFA in tableized (array) form, just as an array lookup. This is why it beats even the hairy wildcard code we had before, after Uwe fixed my bug of course :)

I think there are heuristics like you say we can do, and there's a lot of knowledge in the DFA we can use to implement these for optimal behavior. I think we can also improve the code itself, so that "seek", the nextString() method itself, is more lightweight.

on the other hand the big unknown is the distribution of the term dictionary itself.

I did a very basic implementation here, I'm hoping we can come up with better ideas that work well on average. One problem is, what is an "average" regular expression or wildcard query :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

benchmark results from mike's idea. I don't use any heuristic, just remove the extra 'next' to show the tradeoffs.

disclaimer: against trunk with #3151

Pattern	Iter	AvgHits	AvgMS	AvgMS (noNext)
N?N?N?N	10	1000.0	37.5	28.4
?NNNNNN	10	10.0	6.4	6.1
??NNNNN	10	100.0	9.6	9.2
???NNNN	10	1000.0	52.7	40.9
????NNN	10	10000.0	300.7	262.3
NN??NNN	10	100.0	4.9	4.1
NN?N*	10	10000.0	9.6	28.9
?NN*	10	100000.0	80.4	235.4
*N	10	1000000.0	3811.6	3747.5
*NNNNNN	10	10.0	2098.3	2221.9
NNNNN??	10	100.0	2.2	2.4

Mike my gut feeling, which will require a lot more testing, is that if the automaton accepts a finite language (in the wildcard case, no *), we should not do the next() call. but more benchmarking is needed, with more patterns, especially on flex branch to determine if this heuristic is best.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

in this patch, if the automaton is finite, always seek. if its infinite, keep reading terms sequentially until a term fails (then seek)

it seems to be the best of both worlds, and makes perfect sense to me.

only thing that has me nervous is that SpecialOperations.isFinite() is defined recursively... will have to look into maybe trying to write this method iteratively, in case someone builds some monster automaton from a 2 page regexp or something like that.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

sorry, wrong file. getting lost in iterations of this patch.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

(Doing next() at that point is most likely a waste, and anyway the enum will turn your seek into a next if it's "close")

Actually – I just remembered – flex branch is failing to do this optimization (there's already a nocommit reminding us to do it). Ie, it's always doing the binary search through the indexed terms... and not doing a scan when it determines the term you're seeking to is within the same index block.

But I don't think this'll impact your tests with a large suffix since each seek will jump way ahead to a new index block.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

But I don't think this'll impact your tests with a large suffix since each seek will jump way ahead to a new index block.

Mike, if you add that optimization, that takes care of lucene itself, its smart enough to turn a seek into a read when it should, so I you might say I should simplify my code and just always seek. But if I were to do this, then that would kill the TermRef comparison speedup, because then no matter how much i optimize "my seek" nextString(), it needs to do the unicode conversion, which we have seen is expensive across many terms.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

But if I were to do this, then that would kill the TermRef comparison speedup, because then no matter how much i optimize "my seek" nextString(), it needs to do the unicode conversion, which we have seen is expensive across many terms.

Right, I think ATE must still pick & choose when to seek itself vs seek Lucene, based on how costly nextString() is...

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Make that "when to seek itself vs next() Lucene"

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike by the way, I profiled the seeking on trunk, right at the top with 20% in hprof is the SegmentTermEnum clone... this is why at least for now, and on flex for different reasons, I think we should keep this stupid heuristic.

But its improved now, because at least its using some knowledge of the DFA (whether or not it contains loops) to make this determination, thanks for the idea!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Mike by the way, I profiled the seeking on trunk, right at the top with 20% in hprof is the SegmentTermEnum clone... this is why at least for now, and on flex for different reasons, I think we should keep this stupid heuristic.

OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

OK let's keep it for now? But somehow we need to remember when this gets onto flex branch to put it back...

I guess what I am saying is I think the latest patch, which uses the isFinite() property of the DFA to determine whether or not to seek itself versus trying next(), is the best for both trunk and flex, but for different reasons?

edit: the different reasons being: seek is expensive in trunk, because of the SegmentTermEnum clone() seek is "expensive" in flex, because doing a seek when we are in a loop entails unicode conversion, but next() avoids this with TermRef comparison.

It gives the best of all the scores from https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12782198&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12782198

In the future this could be refined, such that whether to try the extra next() or go ahead and seek() could instead be driven by whether or not we are 'ping-ponging against a loop', i.e. actually pingponging against a wildcard *, rather than computed up-front for the entire query. this can be determined from the state/transitions of the path being evaluated, but its not a one-liner!

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike,

This is ok for the trunk, but I have a question about \uFFFF in flex (I guess we do not need to figure it out now, just think about it). My understanding is that now \uFFFF can be in the index, and I can seek to it (it won't get replaced with \uFFFD). From your comment this seems undefined at the moment, but for this enum I need to know, otherwise it will either skip \uFFFF terms, or go into a loop.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

My understanding is that now \uFFFF can be in the index, and I can seek to it (it won't get replaced with \uFFFD).

Yes, \uFFFF should be untouched now (though I haven't verified – actually I'll go add it to the test we already have for \uFFFF).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Thanks Mike, I will change the enum to reflect this. Currently I cheat and take advantage of this property (in trunk) to make the code simpler.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

here is an updated patch. In my opinion all that is needed is to add the random testing, and code reorganization (but no algorithmic/feature changes), maybe better docs too.

This patch has:

more tests, especially surrogate handling
some more automaton paring
the seek positions and common suffix are always valid UTF-8 strings
backtracking is aware of U+FFFF, so terms can have it.

This patch works correctly on both trunk and flex, although I don't have an included test for U+FFFF since I can't put it in a trunk index.

I'm not too terribly happy about the way the unicode handling works here, its correct but could be coded better. The code I wrote is not the easiest to read and suggestions welcome :) But it is all localized to one method, cleanupPosition(). This is defined as

* if the seek position cannot be converted to valid UTF-8,
* then return the next valid String (in UTF-16 sort order) that
* can be converted to valid UTF-8.

if you have ideas on how to make this nicer I am happy to hear them.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

edit: edit out my chinese chars and replaced with <chineseCharOutsideBMP> as there are some problems indexing this comment.

btw the unicode complexity i mention here is not just me being anal, its an impedence mismatch between the automaton library using UTF-16 code unit representation and the new flex api requiring valid UTF-8.

I am not trying to introduce complexity for no reason, here is an example:

RegExp re = new RegExp("(<chineseCharOutsideBMP>|<chineseCharOutsideBMP>)");
System.out.println(re.toAutomaton());

initial state: 0
state 0 [reject]:
  \ud866 -> 2
state 1 [accept]:
state 2 [reject]:
  \udf05-\udf06 -> 1

as you can see, the automaton library handles these characters correctly, but as code units. so its important not to seek to invalid locations when walking thru the DFA, because these will be replaced by U+FFFD, and terms could be skipped, or we go backwards, creating a loop. Thats why i spent so much time on this.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I added random testing for wildcards and regexps. Don't know what else needs to be done here, please review if you can.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I'll play with it some for one. Fantastic commenting man - this whole patch is pretty darn thorough.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a port of the latest trunk patch to flex branch, for experimenting or whatever.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

btw this patch is a bit different than the last port to flex in one way. Like the trunk patch the commonSuffix is only computed for "linear mode" aka slow queries. but computing this for flex will be a win even for faster queries in "smart mode", because it can dodge more unicode conversion with TermRef byte[] comparison.

the problem is my implementation of getCommonSuffix() is a little crappy, reverse the entire automaton, redeterminize, take its common prefix, then reverse that. instead, improving reverse() so that it keeps determinism when its already a DFA (DFA->DFA) will allow this commonSuffix to be used in both modes without any concern that it will ever hurt performance.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here a flex patch for automaton. It contains #3186, as soon as 2110 is committed I will upload a new patch. But its hard to differentiate between all modified files.

Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Robert: Can you do performance tests with the old and new flex patch, I do not want to commit 2110 before.

Uwe I will run a benchmark on both versions!

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

New patch, there was a lost private field. Also changed the nextSeekTerm method to be more straigtForward.

Robert: Sorry, it would be better to test this one g

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hi Uwe, I ran my benchmarks, and with your patch the performance is the same.

But the code is much simpler and easier to read... great work.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

An update with the changed nextSeekTerm() semantics from #3186.

Robert: Can you test performance again and compare with old?

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

There was a bug in the patch before, sorry. I will finish work for today, I am exhausted like the enums.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Stop everything I get a collaps!!!!! Again wrong patch.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Now the final one.

I somehow need a test enum which does very strange things like seeking forward and backwards and returning all strange stati.

Will think about one tomorrow.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

The new WildcardQuery is holding up very well under random testing -

I'm comparing the results of the old WildcardQuery impl with the new WildcardQuery impl.

I'm using a 2 million doc english and 2 million doc french index. (wikipedia dumps)

Generating random queries - both random short strings built up from random unicode chars mixed with some random wildcards, and random english/french words from dictionaries, randomly chopped or not, with random wildcards injected. A whole lot of crazy randomness.

They have always produced the same number of results so far (a few hours of running).

The new impl is generally either a bit faster in these cases, or about the same - at worst (in general), I've seen it about .01s slower. When its faster, its offten > .1s faster (or more when a few '?' are involved).

On avg, I'd say the perf is about the same - where the new impl shines appears to be when '?' is used (as I think Robert has mentioned).

So far I haven't seen any anomalies in time taken or anything of that nature.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mark, thanks for testing!

Yes, the new wildcard should really only help for ? with trunk (especially leading ?) With flex it should help a lot more, even leading * gets the benefit of "common suffix" and byte[] comparison and things like that. This code is in the trunk patch but does not really help yet because trunk enum works on String.

btw how many uniq terms is the field you are testing... this is where it starts to help with ?, when you have a ton of unique terms. But I am glad you are testing with hopefully a smaller # of uniq terms, this is probably more common.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Here is the patch with the getEnum/getTermsEnum changes instead of rewrite but with reverted #3186, which was stupid.

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

how many uniq terms is the field you are testing

I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually. It is hitting the standard analyzer going in (mainly because I didn't think about changing it on building the indexes). And the queries are getting hit with the lowercase filter (stole the code anyway).

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

again - krr to the hell with the AM/PM bug in JIRA! It is ****xxx**

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually.

See the description, I created this for working mainly regexp on indexes with 100M+ unique terms. Wildcard doesn't get as much benefit, except ? operator and the comparisons being faster (array-based DFA)

I'm pleased to hear its doing so well on such a "small" index as wikipedia, as I would think automata overhead would make it slower (although this can probably be optimized away)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I'm not sure at the moment - but its wikipedia dumps, so I'd guess its rather high actually.

I looked at the wikipedia dump in benchmark (when indexed with standardanalyzer), body only has 65k terms... I think thats pretty small :) I do not think automaton will help much with such a small number of terms, its definitely a worst case benchmark you are performing. I think very little time is probably spent here in term enumeration so scalability does not matter for that corpus.

More interesting to see the benefits would be something like indexing geonames data (lots of terms), or even that (much smaller) persian corpus i mentioned with nearly 500k terms...

asfimport commented 14 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I think thats pretty small

Okay, fair enough ;) Guess it depends on your idea of small - though I would have guess (wrongly it appears), that it would be more. One diff is that I think the bechmark uses a 200mb (zipped) or so dump by default? I'm using a 5 gig dump - though that prob doesn't add too many more in the scheme of things.

More interesting to see the benefits...

Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Right, but I'm not really testing for benefits - more for correctness and no loss of performance (on a fairly standard corpus). I think the benches you have already done are probably plenty good for benefits testing.

oh ok, I didnt know. Because my benchmark as Mike said, is definitely very "contrived".

But its kind of realistic, there are situations where the number of terms compared to the number of docs is much higher (maybe even 1-1 for unique product ids and things like that).

I am glad you did this test, because I was concerned about the "small index" case too. And definitely correctness....

I think you are right about the partial dump. I am indexing the full dump now (at least I think). I will look at it too, at least for curiousity sake.

apache / lucene

Automaton Query/Filter (scalable regex) [LUCENE-1606] #2680

3186

3187

3166