apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.62k stars 1.02k forks source link

Automaton Query/Filter (scalable regex) [LUCENE-1606] #2680

Closed asfimport closed 14 years ago

asfimport commented 15 years ago

Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).

Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.

Some use cases I envision:

  1. lexicography/etc on large text corpora
  2. looking for things such as urls where the prefix is not constant (http:// or ftp://)

The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:

 The algorithm here is pretty basic. Enumerate terms but instead of a binary accept/reject do:

 1. Look at the portion that is OK (did not enter a reject state in the DFA)
 2. Generate the next possible String and seek to that.

the Query simply wraps the filter with ConstantScoreQuery.

I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.


Migrated from LUCENE-1606 by Robert Muir (@rmuir), resolved Dec 09 2009 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606_nodep.patch, LUCENE-1606.patch (versions: 15), LUCENE-1606-flex.patch (versions: 12) Linked issues:

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

patch

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Here is an updated patch with AutomatonWildCardQuery.

This implements standard Lucene Wildcard query with AutomatonFilter.

This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef Sorry, provides no help for leading *, but definitely for leading ?.

All wildcard tests pass.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

Very nice Robert. This looks like it would make a very nice addition to our regex support.

Found the benchmarks here quite interesting: http://tusker.org/regex/regex_benchmark.html (though it sounds like your "special" enumeration technique makes this regex imp even faster for our uses?)

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

oops I did say in javadocs score is constant / boost only so when Wildcard has no wildcards and rewrites to termquery, wrap it with ConstantScoreQuery(QueryWrapperFilter)) to ensure this.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus brics is FAST.

inside the AutomatonFilter i describe how it could possibly be done better, but I was afraid I would mess it up. its affected somewhat by the size of the alphabet so if you were using it against lots of CJK text, it might be worth it to instead use the State/Transition objects in the package. Transitions are described by min and max character intervals and you can access intervals in sorted order...

its all so nice but I figure this is a start.

asfimport commented 15 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Can this do everything that RegexQuery currently does? (Ie we'd deprecate RegexQuery)?

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Mike the thing it cant do is stuff that cannot be determinized. However I think you only need an NFA for capturing group related things:

http://oreilly.com/catalog/regex/chapter/ch04.html

One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied and I think some things need to be escaped. So I think it can do everything RegexQuery does, but maybe different syntax is required.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I looked into the patch, looks good. Maybe it would be good to make the new AutomatonRegExQuey als a subclass of MultiTermQuery. As you also seek/exchange the TermEnum, the needed FilteredTermEnum may be a little bit complicated. But you may do it in the same way like I commit soon for TrieRange (#2676). The latest changes from #2677 make it possible to write a FilteredTermEnum, that handles over to different positioned TermEnums like you do. With MultiTermQuery you get all for free: ConstantScore, Boolean rewrite and optionally the Filter (which is not needed here, I think). And: You could also overwrite difference in FilteredTermEnum to rank the hits. A note: The FilteredTermEnum created by TrieRange is not for sure really ordered correctly according Term.compareTo(), but this is not really needed for MultiTermQuery.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, I agree with you, with one caveat: for this functionality to work the Enum must be ordered correctly according to Term.compareTo().

Otherwise it will not work correctly...

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

It will work, that was what I said. For MultiTermQuery, it must not be ordered, the ordering is irrelevant for it, MultTermQuery only enumerates the terms. TrieRange is an example of that, the order of terms is not for sure ordered correctly (it is at the moment because of the internal implementation of splitLongRange(), but I tested it with the inverse order and it still worked). If you want to use the enum for something other, it will fail. The filters inside MultiTermQuery and the BooleanQuery do not need to have the terms ordered.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, i'll look and see how you do it for TrieRange.

if it can make the code for this simpler that will be fantastic. maybe by then I will have also figured out some way to cleanly and non-recursively use min/max character intervals in the state machine to decrease the amount of seeks and optimize a little bit.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I committed TrieRange revision 765618. You can see the impl here: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/trie/TrieRangeTermEnum.java?view=markup

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, thanks. I'll think on this and on other improvements. I'm not really confident in my ability to make the code much cleaner at the end of the day, but more efficient and get some things for free as you say. For now it is working much better than a linear scan, and the improvements wont change the order, but might help a bit.

Think i should try to correct this issue or create a separate issue?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Let's stay with this issue!

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok I refactored this to use FilteredTermEnum/MultiTermQuery as Uwe suggested.

on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?)

I also changed the term enumeration to be a bit smarter, so it will work well on a large alphabet like CJK now.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?)

Thats surprising, because I have seen people state the opposite on a couple occasions. Perhaps it has to do with how many terms are being enumerated?

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

its \~700ms if i .setConstantScoreRewrite(true) its \~150ms otherwise...

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

How many terms are being enumerated for the test? My guess is that for queries that turn into very large BooleanQueries, it can be much faster to build the filter, but for a smaller BooleanQuery or TermQuery, filter construction dominates?

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

\~ 116,000,000 terms.

I've seen the same behavior with other lucene queries on this index, where I do not care about score and thought filter would be best, but queries still have the edge.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

my test queries are ones that match like 50-100 out of those 116,000,000... so maybe this helps paint the picture.

i can profile each one if you are curious?

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

well here it is just for the record:

in the query case (fast), time is dominated by AutomatonTermEnum.next(). This is what I expect. in the filter case (slower), time is instead dominated by OpenBitSetIterator.next().

I've seen this with simpler (non-MultiTermQuery) queries before as well.

For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.

I thought about that, too. Maybe there will be a possibility to do an auto-switch in MultiTermQuery. If a TooManyBooleanClauses exception is catched during the rewrite() method, it could fall back to returning the ConstantScore variant. The problem: The time for iterating the terms until the Exception thrown is lost... Maybe we could store the iterated terms for reuse (if FilteredTermEnum or a wrapper like BufferedTermEnum has something like the known mark() option from BufferedInputStreams).

This is just an idea, but has nothing to do with this query, it affects all MultiTermQueries.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe: yes I tried to think of some heuristics for this query to guess which would be the best method.

For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true.

I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

yes, I just verified and can easily and quickly detect if the FSM can accept more than BooleanQuery.getMaxClauseCount() Strings.

!Automaton.isFinite() || Automaton.getFiniteStrings(BooleanQuery.getMaxClauseCount()) == null

If you think its ok, I could set constant score rewrite=true in this case.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.

Yes, it should be normally be left to the user. And the slower filter on large indexes with only sparingly filled bitsets is related to #2610.

E.g. I did some comparisions for TrieRangeQuery on a 5 mio doc index, integer field, 8 bit precision step (so about 400 terms per query), the filter is about double as fast. But the ranges were random and hit about 1/3 of all documents in average per query, so the bitset is not so sparse. TrieRangeQuery is a typical example of a MultiTermQuery, that also works well with Boolean rewrite, because the upper term count is limited by the precision step (for ints and 8 bit the theoretical, but never reached, maximum is about 1700 terms, for lower precisionSteps even less).

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, ok based on your tests I tried some of my own... on my index when the query matches like less than 10-20% of the docs Query method is faster.

when it matches something like over 20%, the Filter method starts to win.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

When refactoring multitermquery I tried just computing the bit set iterator on the fly. It did not appear to work out, but I wonder if there are cases where it would be a better option.

bq.For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true.

Okay, that starts to make more sense then. I think the reports that it was faster on some large indexes was based on wildcard queries I think (hard to remember 100%).

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

If you think its ok, I could set constant score rewrite=true in this case.

I agree that it should just be left up to the user. Its probably not a good idea to change the scoring for what to a user could appear to be arbitrary queries.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

updated with smarter enumeration. I think this is mathematically the best you can get with a DFA.

for example if the regexp is (a|b)cdefg it knows to position at acdefg, then bcdefg, etc if the regexp is (a|b)cd*efg it can only position at acd, etc.

nextString() is now cpu-friendly, and instead walks the state transition character intervals in sorted order instead of brute-forcing characters.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

this includes an alternative for another slow linear query, fuzzy query.

automatonfuzzyquery creates a DFA that accepts all strings within an edit distance of 1.

on my 100M term index this works pretty well: fuzzy: 251,219 ms automatonfuzzy: 172 ms

while its true its limited to edit distance of one, on the other hand it supports transposition and is fast.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

found this interesting article applicable to this query: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652

"We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W."

asfimport commented 15 years ago

Eks Dev (migrated from JIRA)

Robert, in order for Lev. Automata to work, you need to have the complete dictionary as DFA. Once you have dictionary as DFA (or any sort of trie), computing simple regex-s or simple fixed or weighted Levenshtein distance becomes a snap. Levenshtein-Automata is particularity fast at it, much simpler and only slightly slower method (one pager code) "K.Oflazer"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862

As said, you cannot really walk current term dictionary as automata/trie (or you have an idea on how to do that?). I guess there is enough application where stoing complete Term dictionary into RAM-DFA is not a problem. Even making some smart (heavily cached) persistent trie/DFA should not be all that complex.

Or you intended just to iterate all terms, and compute distance faster "break LD Matrix computation as soon as you see you hit the boundary"? But this requires iteration over all terms?

I have done something similar, in memory, but unfortunately someone else paid me for this and is not willing to share...

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

eks:

the AutomatonTermEnumerator in this patch does walk the term dictionary according to the transitions present in the DFA. Thats what this JIRA issue is all about to me, not iterating all the terms! So you do not need the complete dictionary as a DFA.

for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg', then 'bcdefg', as opposed to the current regex support which exhaustively enumerates all terms.

slightly more complex example, query of (a|b)cd*efg first seeks to 'acd' (because of kleen star operator). suppose it then encounters term 'acda', it will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to 'bcd'.

this patch implements regex, wildcard, and fuzzy with n=1 in terms of this enumeration. what it doesnt do is fuzzy with arbitrary n!.

I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for the FuzzyAutomatonQuery present in this patch, the paper has a more complicate but linear method to compute the DFA.

asfimport commented 15 years ago

Eks Dev (migrated from JIRA)

hmmm, sounds like good idea, but I am still not convinced it would work for Fuzzy

take simple dictionary: one two three four

query Term is, e.g. "ana", right? and n=1, means your DFA would be: {.na, a.a, an., an, na, ana, .ana, ana., a.na, an.a, ana.} where dot represents any character in you alphabet.

For the first element in DFA (in expanded form) you need to visit all terms, no matter how you walk DFA... or am I missing something?

Where you could save time is actual calculation of LD Matrix for terms that do not pass automata

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

eks, well it does work well for fuzzy n=1 (I have tested against my huge index).

for your simple dictionary it will do 3 comparisons instead of 4. this is because your simple dictionary is sorted in the index as such: four one three two

when it encounters 'three' it will next ask for a TermEnum("una") which will return null.

give it a try on a big dictionary, you might be surprised :)

– Robert Muir rcmuir@gmail.com

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

eks in your example it does three comparisons instead of four (not much of a gain for this example, but a big gain on a real index)

this is because it doesnt need to compare 'two', after encountering 'three' it requests TermEnum("uana"), which returns null.

i hope you can see how this helps for a large index... (or i can try to construct a more realistic example)

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

eks in case this makes it a little better explanation for your example, assume a huge term dictionary where words start with a-zA-Z for simplicity.

for each character in that alphabet it will look for 'Xana' and 'Xna' in the worst case. thats 110 comparisons to check all the words that don't start with 'a'. (the enumeration thru all the words that start with 'a' is a little more complex).

if you have say, 1M unique terms you can see how doing something like 100-200 comparisons is a lot better than 1M.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

removed use of multitermquery's getTerm()

equals/hashcode are defined based upon the field and the language accepted by the FSM, i.e. regex query of AB.*C equals() wildcard query of AB*C because they are the same.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

This is a cool issue, but it hasn't found an assignee yet. We may have to push it to 3.1.

Any interest Uwe?

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I take it, I think it is almost finished. The only problems at the moment are bundling the external library in contrib, which is BSD licensed, are there any problems?

If not, I can manage the inclusion into the regex contrib.

asfimport commented 15 years ago

Mark Miller (@markrmiller) (migrated from JIRA)

I don't think there is a problem with BSD. I know Grant has committed a BSD licensed stop word list in the past.

I've asked explicitly about it before, but got no response.

I'll try and dig a little, but Grant is the PMC head and he did it, so we wouldnt be following bad company...

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert: I applied the patch locally, one test was still using @Override, fixed that. I did only download automaton.jar not the source package.

Do you know, if automaton.jar is compiled using -source 1.4 -target 1.4 (it was compiled using ant 1.7 and Java 1.6). If not sure, I will try to build it again from source and use the correct compiler switches. The regex contrib module is Java 1.4 until now. If automaton only works with 1.5, we should wait until 3.0 to release it.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, you are correct, I just took a glance at the automaton source code and saw StringBuilder, so I think it is safe to say it only works with 1.5...

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Doesn't seem to work, I will check the sources:

compile-core:
    [javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java
    [javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\regex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton
    [javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton
.jar(dk/brics/automaton/Automaton.class)
    [javac] class file has wrong version 49.0, should be 48.0
    [javac] Please remove or make sure it appears in the correct subdirectory of
 the classpath.
    [javac] import dk.brics.automaton.Automaton;
    [javac]                           ^
    [javac] 1 error
asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

So I tend to move this to 3.0 or 3.1, because of missing support in regex contrib.

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, sorry about this.

I did just verify automaton.jar can be compiled for Java 5 (at least it does not have java 1.6 dependencies), so perhaps this can be integrated for a later release.

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1).

Robert: you could supply a patch with StringBuilder toString() variants and all those @Override uncommented-in. And it works correct with 1.5 (I am working with 1.5 here locally - I hate 1.6...).

asfimport commented 15 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Uwe, ok.

Not to try to complicate things, but related to #2763 and java 1.5, I could easily modify the Wildcard functionality here to work correctly with suppl. characters

This could be an alternative to fixing the WildcardQuery ? operator in core.

asfimport commented 15 years ago

Otis Gospodnetic (@otisg) (migrated from JIRA)

Regarding the license - I think we already have BRICS in one of Nutch's plugins, so we should be OK with the BSD licensed jar in our repo.

./urlfilter-automaton/lib/automaton.jar

asfimport commented 15 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

Robert: Do you want to take this again? It's your's and contrib :-)