improve automaton performance by running on byte[] [LUCENE-2265]

asfimport commented 14 years ago

Currently, when enumerating terms, automaton must convert entire terms from flex's native utf-8 byte[] to char[] first, then step each char thru the state machine.

we can make this more efficient, by allowing the state machine to run on byte[], so it can return true/false faster.

Migrated from LUCENE-2265 by Robert Muir (@rmuir), resolved May 03 2010 Attachments: LUCENE-2265_pare.patch, LUCENE-2265_utf32.patch, LUCENE-2265.patch (versions: 11) Linked issues:

3500

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a patch. at most it only improves performance around 10% for Latin-1 text. I haven't benchmarked non-Latin test yet.

Even if its better, I am not sure we want to do this (is the performance worth the complexity?), but all the tests pass.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I tested this with Hindi text as well, it gives no benefit there. So it only helps english, will leave the patch out here in case someone has a better idea :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I discussed this situation with Mike McCandless and I think we might have something of a plan.

For reference, here is the problem:

In flex the terms are byte[] (typically UTF-8)
Automaton transitions work on UTF-16 intervals (char)
RunAutomaton is an array-compiled form that also works on UTF-16 (char[])
Because of this, we have a lot of unicode conversion overhead between byte[] and char[] hurting performance.

Here is the current idea:

Switch Automaton to work on UTF-32 intervals (int)
Create a method to convert a UTF-32 Automaton to an equivalent UTF-8 Automaton.
Create a UTF-8 RunAutomaton that works on byte[]
We could also create a UTF-32 RunAutomaton that works on codepoints, for use in analysis, etc.

This would have some nice benefits besides performance, for example a wildcard operator of "?" or regex operator of "." would match a real unicode codepoint, not a single code unit like it always has. So if somehow we can make this work, we would have better performance and better unicode support.

The trick is to do this UTF-32 DFA -> UTF-8 DFA conversion efficiently, especially keeping determinism, and not causing some nasty explosion

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

I probably missed something.

Why can't you create UTF-8 Automaton from the get go?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Why can't you create UTF-8 Automaton from the get go?

Because high-level, users want automaton transitions to represent real characters (eg regular expressions, wildcards, etc) and do not much care about bytes!

So the utf-16 Automaton/RunAutomaton pair makes sense for the library...

But utf-32 is still easy to work with high-level (we just represent codepoint intervals instead of codeunit), and utf-8 is faster for working with lucene.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

So? You aren't making some generic automaton library, are you? Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it against lucene data.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

I mean, high-level, users don't care about your automaton at all, much less transitions. They want their regexes and wildcards to work.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

So? You aren't making some generic automaton library, are you? Get these user regexes/wildcards/etc, convert them to utf-8, build utf-8 automaton, run it against lucene data.

This just pushes the complexity into the parsers. and yes, it makes sense to support high-level (char[]) operations with automaton too, such as analysis.

I encourage you to take a look at the existing code. In general a lot of parsers (see wildcard and regex) are implemented with primitive automata like 'makeAnyChar'. 'makeAnyByte' makes no sense.

So its generic in the sense that fuzzy, regex, wildcard, all of our users are defined on unicode characters. high level operations such as parsing, intersection, and union belong in utf16 or utf32 space, not with bytes.

bytes is an implementation detail, and we shouldnt operate on UTF-8 except behind the scenes.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I mean, high-level, users don't care about your automaton at all, much less transitions. They want their regexes and wildcards to work.

but you have it backwards. users want their regexes and wildcards to work. they want wildcard "?" or regex "." to match unicode characters, not bytes. no one cares about bytes.

asfimport commented 14 years ago

Earwin Burrfoot (migrated from JIRA)

Hmmmmm.

I'd say that your highlevel operations like intersection and union remain exactly the same regardless of the alphabet you're operating on. The primitive automata, like AnyChar will have to cease being so primitive and generate a pair of states instead of one, but that's essentially it - after primitive automatas are fixed to recognize utf-8 bytes, you don't even have to change parsing code that much.

The only true problem I see is that you lose the ability to operate on char[]. But then, I ask that again, do you write a generic library, or you borrowed automata code from one with a single aim of having fast lucene queries?

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Hmmmmm.

I'd say that your highlevel operations like intersection and union remain exactly the same regardless of the alphabet you're operating on. The primitive automata, like AnyChar will have to cease being so primitive and generate a pair of states instead of one, but that's essentially it - after primitive automatas are fixed to recognize utf-8 bytes, you don't even have to change parsing code that much.

The only true problem I see is that you lose the ability to operate on char[]. But then, I ask that again, do you write a generic library, or you borrowed automata code from one with a single aim of having fast lucene queries?

Well this is a borrowed library, but that doesnt really matter. The trick is that UTF-16 and UTF-32 are much more efficient for the kind of processing the high-level component needs: doing things like NFA->DFA conversion and minimization. Its much better to do some of these quadratic algorithms on high-level unicode instead of byte, and get a minimal DFA. At the same time the intervals represent real things, so its debuggable, etc.

So to me it makes perfect sense to change the transition's min/max from 'char' to 'int', which is trivial and won't require me to rewrite all the primitive automata. Things like NFA-DFA conversion will be actually faster, never slower for some text.

This gives us the opportunity to 'compile' to UTF-8 or UTF-32 RunAutomata (although for the latter we cannot use the classmap trick since the alphabet will be large). This way, it can be used effectively at both a high and low level, and the code is easy to maintain.

I can debug the code now with things like toString and toDot, I certainly cannot do this if the high-level code is changed to byte/UTF-8. It would be completely unmaintainable, and most likely slower overall due to doing quadratic things like determinism on exploded UTF-8 automata.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

as i looked at this, i noticed some unused functionality (numeric fractions and the like).

attached is a patch to remove it. I plan to commit soon.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Attached patch for first cut at APIs to convert a UTF32 automaton to UTF8.

There's a test case that verifies the conversion is working correctly. It seems to be working.

This patch is just the low level API, ie, converting one edge containing a UTF32 range. I still need to fix it to convert an entire UTF32 DFA... should be straightforward.

Also, I need to merge with Robert's int (UTF32) cutover and a UTF8RunAutomaton.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

attached is a totally scary patch to convert the entire automaton lib to utf-32... (i didnt mess with any search code and obviously it wont even compile with this patch)

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Patch w/ first cut at method to cutover whole UTF32 DFA -> UTF8 DFA (and... call determinize on it ;) ).

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

this is mike's patch + my patch + quick hack attempt... most but not all tests are passing :)

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

ok I think i made some serious progress here, but i did find a bug in the utf32 -> utf8 dfa convertor. The problem is it does not handle at least the case where the initial state is an accept state. I created a testcase for this (TestUTF32SpecialCase), and included the python code back, as i figure you will probably fix it there first.

I deleted the surrogate-seeking tests, like other nuances, if we switch to byte[] these won't behave the same, as these regexps are no longer defined.

remaining is to switch the slow fuzzy to use codepoint calculations (to be consistent with the fast one). by the way, its really silly we have to unicode-convert just to get length in chars for that score calculation... ugh!

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

The problem is it does not handle at least the case where the initial state is an accept state.

OK this is a simple fix in the UTF32ToUTF8.convert method – I didn't set isAccept on the initial state – new patch attached that fixes this.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Checkpointing progress from Robert & I on this issue...

The conversion is now done in Java.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Last patch was a bit stale – this one is current, and all tests pass.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

So here are the advantages of the current patch:

full unicode support (Regular Expression, Wildcard, Fuzzy). for example, wildcard ? means codepoint, not code unit.
support for matching all unicode forms easily (utf8, utf16, utf32).
easy to support both native utf8 terms sort order, but also utf8-in-utf16 like we have now. this is not feasible with the existing utf16 representation.
easy to safely do dfa operations on Automaton. this is because there are no surrogates anymore. for example we can safely reverse any automaton to take advantage of Solr's leading wildcard support (e.g. support "leading" regexps, too)
better compatibility with lucene, because automaton is in sync with the terms format (byte). This could lead to future optimizations like TermsEnum exposing the 'shared prefix' of a term with the previous enumerated term.

Unfortunately, there are currently a few disadvantages with the patch, but I think we can resolve these:

The linear fuzzy terms enum, from the old code, needs to be fixed and consistent and use utf32 calculations, too.
for huge dfas (eg fuzzy) there is some cost to the conversion, around 5ms one-time cost on my machine for very long strings. perhaps we can optimize some code here, its not blowing up though.

So in my opinion, the first thing should be resolved before committing, and the second is nice-to-have and shouldn't block the improvement.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch (from Robert & I) – I think this one is ready to commit!

The rest of FuzzyQuery (eg LinearFuzzyTermsEnum) is now cutover to code points (not UTF16 code units), and we've optimized various methods in the automaton package (especially det). Performance of automaton/fuzzy queries looks on par or a bit faster, compared to trunk.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

New patch (from Robert & I) – I think this one is ready to commit!

The rest of FuzzyQuery (eg LinearFuzzyTermsEnum) is now cutover to code points (not UTF16 code units), and we've optimized various methods in the automaton package (especially det). Performance of automaton/fuzzy queries looks on par or a bit faster, compared to trunk.

asfimport commented 14 years ago

Michael McCandless (@mikemccand) (migrated from JIRA)

Latest patch (from Robert!) – strengthens tests, fixes one but in how common suffix was created for AutomatonTermsEnum, cleanup.

asfimport commented 14 years ago

Robert Muir (@rmuir) (migrated from JIRA)

I think we are good to go here... we should look at getting this in soon, as it will then allow us to cutover to UTF-8 sort order.

asfimport commented 14 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

The Generics-Police and Java-5-Police fixed compilation errors/warnings in revision 940743.

apache / lucene

improve automaton performance by running on byte[] [LUCENE-2265] #3341

3500