apache / lucene

Apache Lucene open-source search software
https://lucene.apache.org/
Apache License 2.0
2.57k stars 1.01k forks source link

Regex Query with Backreferences [LUCENE-7411] #8464

Open asfimport opened 8 years ago

asfimport commented 8 years ago

Hi there,

I am currently working on a Regex Engine that supports Backreferences while not losing determinism. It uses Memory Occurence Automata (MOAs) in the engine which are more powerful than normal DFA/NFAs. The engine does no backtracking and recognizes Regexes that cannot be evaluated deterministically as malformed. It has become more and more mature in the last few weeks and I also implemented a Lucene Query that uses these Patterns in the background. Now my question is: Is there any interest for this work to be merged (or adapted) into Lucene core?

EDIT:

The current state is only a mere proof of concept. The performance can probably be improved by a lot by adapting concepts of the Lucene Regexp Query. As Uwe Schindler correctly stated, the Query currently is quite "dumb" as in it doesn't predict what terms to match next.

https://github.com/s4ke/moar

Usage example for the Lucene Query:

https://github.com/s4ke/moar/blob/master/lucene/src/test/java/com/github/s4ke/moar/lucene/query/test/MoarQueryTest.java#L126

Cheers,

Martin


Migrated from LUCENE-7411 by Martin Braun, updated Aug 23 2016

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Hi Martin. I'm sure there'd be more interest in your question/ issue if you included some more information about the advantage of using these MOAs over regular (regular) expressions. A link to the paper that you're writing should be good start once it's published! :) The sample test case is indeed very simple so I don't see how different MOAs are from regular automata (is it a theoretical difference in the languages they accept; is it a practical difference in runtime behavior – much like with NFA/DFA traversal)?

asfimport commented 8 years ago

Martin Braun (migrated from JIRA)

MOAs are a different type of automaton. Their expressive power is greater than the regular theoretical model of NFAs/DFAs.

NFA/DFA Regex engines usually (as far as I know) either build a NFA and run it, build a NFA and construct the DFA, or simulate the NFA implicitly. Then there are the standard Java Regexes which don't really rely on any NFA/DFA logic and also do some special snoflakey stuff that Perl once did in special cases.

Now this is the dilemma: For some applications (like Lucene) you might want to allow Backreferences in your Regexes. But normal Perl Regexes are a beast to handle as matching with them is a NP hard problem (http://perl.plover.com/NPC/) and you probably don't want your index query to take extremely long and then yielding only some few results. The theoretical work my supervisor (it is him and a colleague of his that write the theoretical paper) does is looking into whether it is possible to keep some of the benefits of Backreferences by restricting the capabilities and still having more expressive power than regular NFA/DFA Regexes. Restriction means for example that Regexes like (?<x>a*)\1 are not allowed as we cannot tell whether we should match the next a to the group or not. But we allow stuff like (?<x>a*)|\1 because we know that we can safely add the next a to the matching group as well.

Also at the moment we don't differentiate between lazy or greedy group matching. If we can't tell what part of a regex we have to match the next character to, we consider this a non deterministic regex (The definition and handling of these cases in the library might change if people really need them).

This Query type is also not recommended as a drop in replacement for Lucenes Regexp Query, it is only meant for the people that really need backreference support in their regexes.

EDIT:

As of now it is also possible to create the automata by hand as the Regexes are not as powerful as the MOAs, but I don't know whether that has a real world application and is not only meant for people that want to mess with the automaton model.

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I just looked at the query code. Unfortunately, the TermsEnum there does not allow (like Lucene's original AutomatonTermsEnum) seeking through the terms based on the automaton structure. So basically if you execute the query it will match the terms fine, but it has to iterate all the terms in the index, which can be many: https://github.com/s4ke/moar/blob/master/lucene/src/main/java/com/github/s4ke/moar/lucene/query/MoarQuery.java#L65-L83

In contrast AutomatonTermsEnum (https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/index/AutomatonTermsEnum.java) allows to seek based on the Automaton structure (this allows "guessing" the next candidate term to match). E.g. if you have an automaton like "(http|ftp)://.*", the query will not iterate all terms that can never match at all (like those not starting with "h" or "f"), so it seeks over them. It can do this, because the terms index is also seen as an Automaton (a sorted list of terms).

asfimport commented 8 years ago

Martin Braun (migrated from JIRA)

Sorry, the query currently is really just a barebone call to the matching function. Please regard this just as a proof of concept for now :)

asfimport commented 8 years ago

Uwe Schindler (@uschindler) (migrated from JIRA)

I just wanted to ask, if this is possible to implement at all with MOAs!

asfimport commented 8 years ago

Martin Braun (migrated from JIRA)

I/We will have to look into it. But that's some great input :)

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

Thanks for the clarification, Martin. As for the theoretical part – I'd be really curious to see how MOAs are different to automata (in general); unless you can have some kind of recursive backreferences which would be indeed a strange, if theoretically interesting, concept!

To me personally it seems like backreferences are a very seldom used feature. Most people have very superficial (if any) knowledge of regular expressions; anything going beyond a simple alternative or glob-like match is an "advanced" scenario. Backreferences would, in my opinion, be of interest to perhaps .1% of that advanced audience... Not that it isn't interesting – I love automata theory – it's just my personal opinion that including such a feature wouldn't be of general use. I don't see any problem in you maintaining an add-on module to Lucene though (for those experts that need such a feature).

Speaking of automata. You mentioned Java's built-in regexes. They indeed don't transform the input expressions into (direct) automata, but they're really, really fast on the average case (and yes, exponentially slow in the worst case). This is by design. Most of the time the difference will be in favor of "fast on average" if you don't have adversarial scenarios (like users willing to DOS your service).

I just recently had a look at Russ Cox's re2 because we needed fast regex matching routine. There is a port of re2 to Java (simplified). The test was a compound expression consisting of a union of several hundred (or thousands) different (fairly trivial) regular expressions, matched against thousands of input sequences. A pattern detection engine, basically. Java's built-in regular expression engine was an order of magnitude (sic!) faster than anything else available, including the C version of re2... If you can showcase your engine to perform at the same level it'd be a splendid (research and practical) result.

asfimport commented 8 years ago

Martin Braun (migrated from JIRA)

If you know what you are doing and 100% know that the Regex won't blow up, I think the MOA engine can't compete (as of now? I mean, Java's Regexes are highly optimized).

asfimport commented 8 years ago

Martin Braun (migrated from JIRA)

About the "recurvisve" backreferences: If you mean something like this: "((?<y>\k<x>)(?<x>\k<y>a))+". Then yes, the engine (and the MOAs) supports these (I don't know of any real-world application of this though :D)

asfimport commented 8 years ago

Dawid Weiss (@dweiss) (migrated from JIRA)

I don't know of any real-world application of this though

Exactly...