Closed asfimport closed 14 years ago
Robert Muir (@rmuir) (migrated from JIRA)
patch
Robert Muir (@rmuir) (migrated from JIRA)
Here is an updated patch with AutomatonWildCardQuery.
This implements standard Lucene Wildcard query with AutomatonFilter.
This accelerates quite a few wildcard situations, such as ??(a|b)?cd*ef Sorry, provides no help for leading *, but definitely for leading ?.
All wildcard tests pass.
Mark Miller (@markrmiller) (migrated from JIRA)
Very nice Robert. This looks like it would make a very nice addition to our regex support.
Found the benchmarks here quite interesting: http://tusker.org/regex/regex_benchmark.html (though it sounds like your "special" enumeration technique makes this regex imp even faster for our uses?)
Robert Muir (@rmuir) (migrated from JIRA)
oops I did say in javadocs score is constant / boost only so when Wildcard has no wildcards and rewrites to termquery, wrap it with ConstantScoreQuery(QueryWrapperFilter)) to ensure this.
Robert Muir (@rmuir) (migrated from JIRA)
mark yeah, the enumeration helps a lot, it means a lot less comparisons, plus brics is FAST.
inside the AutomatonFilter i describe how it could possibly be done better, but I was afraid I would mess it up. its affected somewhat by the size of the alphabet so if you were using it against lots of CJK text, it might be worth it to instead use the State/Transition objects in the package. Transitions are described by min and max character intervals and you can access intervals in sorted order...
its all so nice but I figure this is a start.
Michael McCandless (@mikemccand) (migrated from JIRA)
Can this do everything that RegexQuery currently does? (Ie we'd deprecate RegexQuery)?
Robert Muir (@rmuir) (migrated from JIRA)
Mike the thing it cant do is stuff that cannot be determinized. However I think you only need an NFA for capturing group related things:
http://oreilly.com/catalog/regex/chapter/ch04.html
One thing is that the brics syntax is a bit different. i.e. ^ and $ are implied and I think some things need to be escaped. So I think it can do everything RegexQuery does, but maybe different syntax is required.
Uwe Schindler (@uschindler) (migrated from JIRA)
I looked into the patch, looks good. Maybe it would be good to make the new AutomatonRegExQuey als a subclass of MultiTermQuery. As you also seek/exchange the TermEnum, the needed FilteredTermEnum may be a little bit complicated. But you may do it in the same way like I commit soon for TrieRange (#2676). The latest changes from #2677 make it possible to write a FilteredTermEnum, that handles over to different positioned TermEnums like you do. With MultiTermQuery you get all for free: ConstantScore, Boolean rewrite and optionally the Filter (which is not needed here, I think). And: You could also overwrite difference in FilteredTermEnum to rank the hits. A note: The FilteredTermEnum created by TrieRange is not for sure really ordered correctly according Term.compareTo(), but this is not really needed for MultiTermQuery.
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, I agree with you, with one caveat: for this functionality to work the Enum must be ordered correctly according to Term.compareTo().
Otherwise it will not work correctly...
Uwe Schindler (@uschindler) (migrated from JIRA)
It will work, that was what I said. For MultiTermQuery, it must not be ordered, the ordering is irrelevant for it, MultTermQuery only enumerates the terms. TrieRange is an example of that, the order of terms is not for sure ordered correctly (it is at the moment because of the internal implementation of splitLongRange(), but I tested it with the inverse order and it still worked). If you want to use the enum for something other, it will fail. The filters inside MultiTermQuery and the BooleanQuery do not need to have the terms ordered.
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, i'll look and see how you do it for TrieRange.
if it can make the code for this simpler that will be fantastic. maybe by then I will have also figured out some way to cleanly and non-recursively use min/max character intervals in the state machine to decrease the amount of seeks and optimize a little bit.
Uwe Schindler (@uschindler) (migrated from JIRA)
I committed TrieRange revision 765618. You can see the impl here: http://svn.apache.org/viewvc/lucene/java/trunk/contrib/queries/src/java/org/apache/lucene/search/trie/TrieRangeTermEnum.java?view=markup
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, thanks. I'll think on this and on other improvements. I'm not really confident in my ability to make the code much cleaner at the end of the day, but more efficient and get some things for free as you say. For now it is working much better than a linear scan, and the improvements wont change the order, but might help a bit.
Think i should try to correct this issue or create a separate issue?
Uwe Schindler (@uschindler) (migrated from JIRA)
Let's stay with this issue!
Robert Muir (@rmuir) (migrated from JIRA)
ok I refactored this to use FilteredTermEnum/MultiTermQuery as Uwe suggested.
on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?)
I also changed the term enumeration to be a bit smarter, so it will work well on a large alphabet like CJK now.
Mark Miller (@markrmiller) (migrated from JIRA)
on my big index its actually faster without setting the constant score rewrite (maybe creating the huge bitset is expensive?)
Thats surprising, because I have seen people state the opposite on a couple occasions. Perhaps it has to do with how many terms are being enumerated?
Robert Muir (@rmuir) (migrated from JIRA)
its \~700ms if i .setConstantScoreRewrite(true) its \~150ms otherwise...
Mark Miller (@markrmiller) (migrated from JIRA)
How many terms are being enumerated for the test? My guess is that for queries that turn into very large BooleanQueries, it can be much faster to build the filter, but for a smaller BooleanQuery or TermQuery, filter construction dominates?
Robert Muir (@rmuir) (migrated from JIRA)
\~ 116,000,000 terms.
I've seen the same behavior with other lucene queries on this index, where I do not care about score and thought filter would be best, but queries still have the edge.
Robert Muir (@rmuir) (migrated from JIRA)
my test queries are ones that match like 50-100 out of those 116,000,000... so maybe this helps paint the picture.
i can profile each one if you are curious?
Robert Muir (@rmuir) (migrated from JIRA)
well here it is just for the record:
in the query case (fast), time is dominated by AutomatonTermEnum.next(). This is what I expect. in the filter case (slower), time is instead dominated by OpenBitSetIterator.next().
I've seen this with simpler (non-MultiTermQuery) queries before as well.
For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.
Uwe Schindler (@uschindler) (migrated from JIRA)
For this functionality I still like the constant score rewrite option because there is no risk of hitting the boolean clause limit.
I thought about that, too. Maybe there will be a possibility to do an auto-switch in MultiTermQuery. If a TooManyBooleanClauses exception is catched during the rewrite() method, it could fall back to returning the ConstantScore variant. The problem: The time for iterating the terms until the Exception thrown is lost... Maybe we could store the iterated terms for reuse (if FilteredTermEnum or a wrapper like BufferedTermEnum has something like the known mark() option from BufferedInputStreams).
This is just an idea, but has nothing to do with this query, it affects all MultiTermQueries.
Robert Muir (@rmuir) (migrated from JIRA)
Uwe: yes I tried to think of some heuristics for this query to guess which would be the best method.
For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true.
I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.
Robert Muir (@rmuir) (migrated from JIRA)
yes, I just verified and can easily and quickly detect if the FSM can accept more than BooleanQuery.getMaxClauseCount() Strings.
!Automaton.isFinite() || Automaton.getFiniteStrings(BooleanQuery.getMaxClauseCount()) == null
If you think its ok, I could set constant score rewrite=true in this case.
Uwe Schindler (@uschindler) (migrated from JIRA)
I didn't do any of this because I wasn't sure if this constant score rewrite option is something that should be entirely left to the user, or not.
Yes, it should be normally be left to the user. And the slower filter on large indexes with only sparingly filled bitsets is related to #2610.
E.g. I did some comparisions for TrieRangeQuery on a 5 mio doc index, integer field, 8 bit precision step (so about 400 terms per query), the filter is about double as fast. But the ranges were random and hit about 1/3 of all documents in average per query, so the bitset is not so sparse. TrieRangeQuery is a typical example of a MultiTermQuery, that also works well with Boolean rewrite, because the upper term count is limited by the precision step (for ints and 8 bit the theoretical, but never reached, maximum is about 1700 terms, for lower precisionSteps even less).
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, ok based on your tests I tried some of my own... on my index when the query matches like less than 10-20% of the docs Query method is faster.
when it matches something like over 20%, the Filter method starts to win.
Mark Miller (@markrmiller) (migrated from JIRA)
When refactoring multitermquery I tried just computing the bit set iterator on the fly. It did not appear to work out, but I wonder if there are cases where it would be a better option.
bq.For example, if the language of the automaton is infinite (for example, built from a regular expression/wildcard with a * operator), it seems best to set constant score rewrite=true.
Okay, that starts to make more sense then. I think the reports that it was faster on some large indexes was based on wildcard queries I think (hard to remember 100%).
Mark Miller (@markrmiller) (migrated from JIRA)
If you think its ok, I could set constant score rewrite=true in this case.
I agree that it should just be left up to the user. Its probably not a good idea to change the scoring for what to a user could appear to be arbitrary queries.
Robert Muir (@rmuir) (migrated from JIRA)
updated with smarter enumeration. I think this is mathematically the best you can get with a DFA.
for example if the regexp is (a|b)cdefg it knows to position at acdefg, then bcdefg, etc if the regexp is (a|b)cd*efg it can only position at acd, etc.
nextString() is now cpu-friendly, and instead walks the state transition character intervals in sorted order instead of brute-forcing characters.
Robert Muir (@rmuir) (migrated from JIRA)
this includes an alternative for another slow linear query, fuzzy query.
automatonfuzzyquery creates a DFA that accepts all strings within an edit distance of 1.
on my 100M term index this works pretty well: fuzzy: 251,219 ms automatonfuzzy: 172 ms
while its true its limited to edit distance of one, on the other hand it supports transposition and is fast.
Robert Muir (@rmuir) (migrated from JIRA)
found this interesting article applicable to this query: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652
"We show how to compute, for any fixed bound n and any input word W, a deterministic Levenshtein-automaton of degree n for W in time linear in the length of W."
Eks Dev (migrated from JIRA)
Robert, in order for Lev. Automata to work, you need to have the complete dictionary as DFA. Once you have dictionary as DFA (or any sort of trie), computing simple regex-s or simple fixed or weighted Levenshtein distance becomes a snap. Levenshtein-Automata is particularity fast at it, much simpler and only slightly slower method (one pager code) "K.Oflazer"http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.3862
As said, you cannot really walk current term dictionary as automata/trie (or you have an idea on how to do that?). I guess there is enough application where stoing complete Term dictionary into RAM-DFA is not a problem. Even making some smart (heavily cached) persistent trie/DFA should not be all that complex.
Or you intended just to iterate all terms, and compute distance faster "break LD Matrix computation as soon as you see you hit the boundary"? But this requires iteration over all terms?
I have done something similar, in memory, but unfortunately someone else paid me for this and is not willing to share...
Robert Muir (@rmuir) (migrated from JIRA)
eks:
the AutomatonTermEnumerator in this patch does walk the term dictionary according to the transitions present in the DFA. Thats what this JIRA issue is all about to me, not iterating all the terms! So you do not need the complete dictionary as a DFA.
for example: a regexp query of (a|b)cdefg with this patch seeks to 'acdefg', then 'bcdefg', as opposed to the current regex support which exhaustively enumerates all terms.
slightly more complex example, query of (a|b)cd*efg first seeks to 'acd' (because of kleen star operator). suppose it then encounters term 'acda', it will next seek to 'acdd', etc. if it encounters 'acdf', then next it seeks to 'bcd'.
this patch implements regex, wildcard, and fuzzy with n=1 in terms of this enumeration. what it doesnt do is fuzzy with arbitrary n!.
I used the simplistic quadratic method to compute a DFA for fuzzy with n=1 for the FuzzyAutomatonQuery present in this patch, the paper has a more complicate but linear method to compute the DFA.
Eks Dev (migrated from JIRA)
hmmm, sounds like good idea, but I am still not convinced it would work for Fuzzy
take simple dictionary: one two three four
query Term is, e.g. "ana", right? and n=1, means your DFA would be: {.na, a.a, an., an, na, ana, .ana, ana., a.na, an.a, ana.} where dot represents any character in you alphabet.
For the first element in DFA (in expanded form) you need to visit all terms, no matter how you walk DFA... or am I missing something?
Where you could save time is actual calculation of LD Matrix for terms that do not pass automata
Robert Muir (@rmuir) (migrated from JIRA)
eks, well it does work well for fuzzy n=1 (I have tested against my huge index).
for your simple dictionary it will do 3 comparisons instead of 4. this is because your simple dictionary is sorted in the index as such: four one three two
when it encounters 'three' it will next ask for a TermEnum("una") which will return null.
give it a try on a big dictionary, you might be surprised :)
– Robert Muir rcmuir@gmail.com
Robert Muir (@rmuir) (migrated from JIRA)
eks in your example it does three comparisons instead of four (not much of a gain for this example, but a big gain on a real index)
this is because it doesnt need to compare 'two', after encountering 'three' it requests TermEnum("uana"), which returns null.
i hope you can see how this helps for a large index... (or i can try to construct a more realistic example)
Robert Muir (@rmuir) (migrated from JIRA)
eks in case this makes it a little better explanation for your example, assume a huge term dictionary where words start with a-zA-Z for simplicity.
for each character in that alphabet it will look for 'Xana' and 'Xna' in the worst case. thats 110 comparisons to check all the words that don't start with 'a'. (the enumeration thru all the words that start with 'a' is a little more complex).
if you have say, 1M unique terms you can see how doing something like 100-200 comparisons is a lot better than 1M.
Robert Muir (@rmuir) (migrated from JIRA)
removed use of multitermquery's getTerm()
equals/hashcode are defined based upon the field and the language accepted by the FSM, i.e. regex query of AB.*C equals() wildcard query of AB*C because they are the same.
Mark Miller (@markrmiller) (migrated from JIRA)
This is a cool issue, but it hasn't found an assignee yet. We may have to push it to 3.1.
Any interest Uwe?
Uwe Schindler (@uschindler) (migrated from JIRA)
I take it, I think it is almost finished. The only problems at the moment are bundling the external library in contrib, which is BSD licensed, are there any problems?
If not, I can manage the inclusion into the regex contrib.
Mark Miller (@markrmiller) (migrated from JIRA)
I don't think there is a problem with BSD. I know Grant has committed a BSD licensed stop word list in the past.
I've asked explicitly about it before, but got no response.
I'll try and dig a little, but Grant is the PMC head and he did it, so we wouldnt be following bad company...
Uwe Schindler (@uschindler) (migrated from JIRA)
Robert: I applied the patch locally, one test was still using @Override
, fixed that. I did only download automaton.jar not the source package.
Do you know, if automaton.jar is compiled using -source 1.4 -target 1.4 (it was compiled using ant 1.7 and Java 1.6). If not sure, I will try to build it again from source and use the correct compiler switches. The regex contrib module is Java 1.4 until now. If automaton only works with 1.5, we should wait until 3.0 to release it.
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, you are correct, I just took a glance at the automaton source code and saw StringBuilder, so I think it is safe to say it only works with 1.5...
Uwe Schindler (@uschindler) (migrated from JIRA)
Doesn't seem to work, I will check the sources:
compile-core:
[javac] Compiling 12 source files to C:\Projects\lucene\trunk\build\contrib\regex\classes\java
[javac] C:\Projects\lucene\trunk\contrib\regex\src\java\org\apache\lucene\search\regex\AutomatonFuzzyQuery.java:11: cannot access dk.brics.automaton.Automaton
[javac] bad class file: C:\Projects\lucene\trunk\contrib\regex\lib\automaton
.jar(dk/brics/automaton/Automaton.class)
[javac] class file has wrong version 49.0, should be 48.0
[javac] Please remove or make sure it appears in the correct subdirectory of
the classpath.
[javac] import dk.brics.automaton.Automaton;
[javac] ^
[javac] 1 error
Uwe Schindler (@uschindler) (migrated from JIRA)
So I tend to move this to 3.0 or 3.1, because of missing support in regex contrib.
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, sorry about this.
I did just verify automaton.jar can be compiled for Java 5 (at least it does not have java 1.6 dependencies), so perhaps this can be integrated for a later release.
Uwe Schindler (@uschindler) (migrated from JIRA)
I move this to 3.0 (and not 3.1), because it can be released together with 3.0 (contrib modules do not need to wait until 3.1).
Robert: you could supply a patch with StringBuilder toString() variants and all those @Override
uncommented-in. And it works correct with 1.5 (I am working with 1.5 here locally - I hate 1.6...).
Robert Muir (@rmuir) (migrated from JIRA)
Uwe, ok.
Not to try to complicate things, but related to #2763 and java 1.5, I could easily modify the Wildcard functionality here to work correctly with suppl. characters
This could be an alternative to fixing the WildcardQuery ? operator in core.
Otis Gospodnetic (@otisg) (migrated from JIRA)
Regarding the license - I think we already have BRICS in one of Nutch's plugins, so we should be OK with the BSD licensed jar in our repo.
./urlfilter-automaton/lib/automaton.jar
Uwe Schindler (@uschindler) (migrated from JIRA)
Robert: Do you want to take this again? It's your's and contrib :-)
Attached is a patch for an AutomatonQuery/Filter (name can change if its not suitable).
Whereas the out-of-box contrib RegexQuery is nice, I have some very large indexes (100M+ unique tokens) where queries are quite slow, 2 minutes, etc. Additionally all of the existing RegexQuery implementations in Lucene are really slow if there is no constant prefix. This implementation does not depend upon constant prefix, and runs the same query in 640ms.
Some use cases I envision:
The Filter uses the BRICS package (http://www.brics.dk/automaton/) to convert regular expressions into a DFA. Then, the filter "enumerates" terms in a special way, by using the underlying state machine. Here is my short description from the comments:
the Query simply wraps the filter with ConstantScoreQuery.
I did not include the automaton.jar inside the patch but it can be downloaded from http://www.brics.dk/automaton/ and is BSD-licensed.
Migrated from LUCENE-1606 by Robert Muir (@rmuir), resolved Dec 09 2009 Attachments: automaton.patch, automatonMultiQuery.patch, automatonmultiqueryfuzzy.patch, automatonMultiQuerySmart.patch, automatonWithWildCard.patch, automatonWithWildCard2.patch, BenchWildcard.java, LUCENE-1606_nodep.patch, LUCENE-1606.patch (versions: 15), LUCENE-1606-flex.patch (versions: 12) Linked issues:
3186
3187
3166