Genivia / RE-flex

A high-performance C++ regex library and lexical analyzer generator with Unicode support. Extends Flex++ with Unicode support, indent/dedent anchors, lazy quantifiers, functions for lex and syntax error reporting and more. Seamlessly integrates with Bison and other parsers.
https://www.genivia.com/doc/reflex/html
BSD 3-Clause "New" or "Revised" License
520 stars 85 forks source link

Unexpected output from matcher #177

Closed namfromvietnam closed 1 year ago

namfromvietnam commented 1 year ago

Hello,

I'm writing to ask a question about the usage of reflex::Matcher to see if there's something I'm missing or mis-understood.

Please consider the following code:

#include <reflex/matcher.h>
#include <iostream>

int main()
{
    reflex::Pattern pattern("(?:hello|good morning).{1,9}?andrew (?:bernard|bernstein|bern)");
    reflex::Matcher matcher(pattern, "hello my andrew bernard");

    while(matcher.find() != 0)
        std::cout << "Found " << matcher.text() << std::endl;
}

From my testing using regex101.com, the matching result is expected to be the entire input text

hello my andrew bernard

However, upon compiling and running this piece of code against release 3.3.2, the result I got is slightly different

hello my andrew bern

Would you mind help pointing out if my code snippet above is missing some configuration? I'm trying to look through the documentation to find some clues, but haven't been successful so far.

Thank you

genivia-inc commented 1 year ago

This happens because .{1,9}? with a lazy quantifier ? may not always behave exactly like Perl matching, but rather it is efficient POSIX matching with shortest matches to satisfy lazy quantification. POSIX matching with DFAs is more efficient than Perl, but the two use different matching rules. The POSIX longest leftmost matching rule changes after the lazy quantifier to be satisfied early. Therefore (?:bernard|bernstein|bern) or (?:bern|bernstein|bernard) make no difference with POSIX matching and both cut after bern, but it does make a difference with Perl matching. Also .{1,10}? can change this behavior. Perl-like matching (and POSIX with lazy quantifiers) is very sensitive to the pattern and input. For Perl matching use pcre2matcher.h with PCRE2 or boostmatcher.h with Boost.

namfromvietnam commented 1 year ago

Oh I see, thank you for your reply. One more question if you don’t mind me asking, you mentioned that changing 9 to 10 in the range would change the behaviour, would you mind explain?

genivia-inc commented 1 year ago

Ah, no I meant something else I was thinking about when I wrote that. That will not change the behavior.

The behavior is predictable. When the matching pattern passes through a lazy quantifier, the longest match is no longer a priority for POSIX matching. Instead, a short match will do. However, if after the lazy quantifier a greedy operator is matched, then we are back at POSIX longest matching. For example, (?:hello|good morning).{1,10}?andrew (?:bern|bernstein|bernard)+ will match bernard, because we force greedy + matching.

The POSIX regex matching with DFAs is more efficient than Perl regex matching with NFAs (or similar), because DFA matching never backtracks when matching a pattern. With today's fast machines the performance may not be as noticeable as it used to be in the past, but there are pathological cases when Perl regex matching can blow up (note: just as DFA construction in pathological cases can blow up in DFA size, but matching is fast).

I'm thinking that (?:hello|good morning).{1,10}?andrew (?:bern|bernstein|bernard){1,1} with {1,1} at the end should actually override the lazy quantifier, but that is currently not the case, because {1,1} is optimized away as a no-op. I should look into this to still override laziness.

genivia-inc commented 1 year ago

I will update reflex later, so that {1,1} overrides laziness. Perhaps that was a minor oversight on my side. It would be nicer if it were possible for the POSIX matching with lazy quantifiers to recognize alternations like (?:bern|bernstein|bernard) as greedy, but I had tried that before to improve the algorithm, but that affected laziness negatively in ways that made laziness unpredictable.

namfromvietnam commented 1 year ago

Thank you for the detailed explanation!