Inconsistency with COMMIT in positive lookaheads

PCRE2Project / pcre2

PCRE2 development is now based here.

Other

919 stars 191 forks source link

Inconsistency with COMMIT in positive lookaheads #352

Closed addisoncrump closed 11 months ago

addisoncrump commented 12 months ago

This is the most reduced I could get this testcase; apologies for it being a rather lengthy one.

  re> /a?(?=b(*COMMIT)c|)d/
data> bd
 0: d
data> bd\=no_jit
No match

It seems that a? is necessary and cannot be replaced with a*. c following (*COMMIT) is necessary. | in the lookahead is necessary. d is necessary.

PhilipHazel commented 11 months ago

Perl behaves the same as the interpreter, and gives no match, which I think is the right behaviour, but please tell me if I'm wrong.

zherczeg commented 11 months ago

Is something changed? Commits in asserts aborts only the assert, at least it was the story in the past. So matching 'b' fails (because the assert fails), but continuing the match from 'd' results with a match in jit.

zherczeg commented 11 months ago

Btw if I remove the a? both jit and interpreter (and perl) matches to 'd'. This is very confusing for me.

PhilipHazel commented 11 months ago

And me! I will take a closer look at the interpreter.

PhilipHazel commented 11 months ago

Aha. When a? is present, the pattern is compiled with "Starting code units: a b d", but without a? it has "First code unit = 'd'". This kind of behaviour is actually documented in pcre2pattern where the pattern /(*COMMIT)abc/ is discussed. This matches "xyzabc" both in the interpreter and Perl, but not in the interpreter if optimizations are turned off. Here's another oddity: if you run with -ac (auto callout) JIT and the interpreter agree for the original pattern.

zherczeg commented 11 months ago

'd' is included in both code unit sets, and 'd' should be a valid match regardless of 'a?'. But is seems to me the interpreter does not start a match from 'd'. Am I missing something?

PhilipHazel commented 11 months ago

The interpreter has all three letters a,b,d as possible starting points, so it starts at b. But why is JIT different with auto-callout?

zherczeg commented 11 months ago

I believe I understand this issue now. 1) Commit is not caught by positive asserts. I forgot about this. This is why I talked about it first, but nobody corrected me. 2) JIT is smart enough to recognize that the pattern must start with a or d, so it ignores the 'b' character. Callouts affects this analysis (negatively, I don't remember the rules).

So Philip was right. Start optimizations may confuse certain regex constructs, and this is acceptable. If proper execution is really needed, the no_start_optimize flag can be used:

/a?(?=b(*COMMIT)c|)d/no_start_optimize yields no match for both interpreter and jit.

So this issue will be a wontfix.

PhilipHazel commented 11 months ago

Actually, I have "fixed" it! Sort of. I realized that the logic for determining possible starting code units could be improved a little bit. It now ignores positive assertions if the following item sets a mandatory starting character, so this pattern now just sets 'a' and 'd' as starting code units, instead of 'a', 'b', and 'd'. This makes the interpreter behave the same as the JIT (in the absence of callouts). (Just realized that I can probably improve things more by taking callouts into account.)

PhilipHazel commented 11 months ago

Now also works with callouts. I'm going to close this issue as completed.