apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.6k stars 1.01k forks source link

CompiledAutomaton performance for determining common suffix [LUCENE-8102] #9150

Open asfimport opened 6 years ago

asfimport commented 6 years ago

We're using the automaton package as part of Elasticsearch for doing regexp queries. Our business requires us to process rather complex regular expressions, for example (we have more complex examples, but this one illustrates the problem):

        (¦.)*(¦?[^¦]){1,10}ab(¦.)*(¦?[^¦]){1,10}c(¦.)*(¦?[^¦]){1,10}d

With a large enough value of maxDeterminizedStates, this works. The problem we're having is that the conversion of this regular expression to a CompiledAutomaton takes very long. Almost all of the time goes into determining the common suffix for the Automaton (which is "d" in this example) - calculated with a call to Operations.getCommonSuffixBytesRef.

This suffix is only used as an optimization. Skipping the calculation of this suffix allows us to process these kinds of queries.

Reaction from Mike McCandless on the mailing list: This is just an optimization; maybe we should expose an option to disable it? Or maybe we can find the common suffix on an NFA instead, to avoid determinization?


Migrated from LUCENE-8102 by Thomas Poppe

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

The optimization speeds up leading wildcard type-queries (such as wildcard "*foo" or regex ".*foo") which are far more common e.g. on the users list than the example presented on this issue, so I really think we should keep it.

The regex here is quite obscene, maybe its the thing that should be optimized? :)

asfimport commented 6 years ago

Robert Muir (@rmuir) (migrated from JIRA)

Also here are few more ideas:

Either way, I think its really important not to regress on common "stupid simple" queries like leading wildcards.