Closed namfromvietnam closed 1 year ago
This happens because .{1,9}?
with a lazy quantifier ?
may not always behave exactly like Perl matching, but rather it is efficient POSIX matching with shortest matches to satisfy lazy quantification. POSIX matching with DFAs is more efficient than Perl, but the two use different matching rules. The POSIX longest leftmost matching rule changes after the lazy quantifier to be satisfied early. Therefore (?:bernard|bernstein|bern)
or (?:bern|bernstein|bernard)
make no difference with POSIX matching and both cut after bern
, but it does make a difference with Perl matching. Also .{1,10}?
can change this behavior. Perl-like matching (and POSIX with lazy quantifiers) is very sensitive to the pattern and input. For Perl matching use pcre2matcher.h
with PCRE2 or boostmatcher.h
with Boost.
Oh I see, thank you for your reply. One more question if you don’t mind me asking, you mentioned that changing 9 to 10 in the range would change the behaviour, would you mind explain?
Ah, no I meant something else I was thinking about when I wrote that. That will not change the behavior.
The behavior is predictable. When the matching pattern passes through a lazy quantifier, the longest match is no longer a priority for POSIX matching. Instead, a short match will do. However, if after the lazy quantifier a greedy operator is matched, then we are back at POSIX longest matching. For example, (?:hello|good morning).{1,10}?andrew (?:bern|bernstein|bernard)+
will match bernard
, because we force greedy +
matching.
The POSIX regex matching with DFAs is more efficient than Perl regex matching with NFAs (or similar), because DFA matching never backtracks when matching a pattern. With today's fast machines the performance may not be as noticeable as it used to be in the past, but there are pathological cases when Perl regex matching can blow up (note: just as DFA construction in pathological cases can blow up in DFA size, but matching is fast).
I'm thinking that (?:hello|good morning).{1,10}?andrew (?:bern|bernstein|bernard){1,1}
with {1,1}
at the end should actually override the lazy quantifier, but that is currently not the case, because {1,1}
is optimized away as a no-op. I should look into this to still override laziness.
I will update reflex later, so that {1,1}
overrides laziness. Perhaps that was a minor oversight on my side. It would be nicer if it were possible for the POSIX matching with lazy quantifiers to recognize alternations like (?:bern|bernstein|bernard)
as greedy, but I had tried that before to improve the algorithm, but that affected laziness negatively in ways that made laziness unpredictable.
Thank you for the detailed explanation!
Hello,
I'm writing to ask a question about the usage of
reflex::Matcher
to see if there's something I'm missing or mis-understood.Please consider the following code:
From my testing using regex101.com, the matching result is expected to be the entire input text
However, upon compiling and running this piece of code against release 3.3.2, the result I got is slightly different
Would you mind help pointing out if my code snippet above is missing some configuration? I'm trying to look through the documentation to find some clues, but haven't been successful so far.
Thank you