Closed addisoncrump closed 11 months ago
Perl behaves the same as the interpreter, and gives no match, which I think is the right behaviour, but please tell me if I'm wrong.
Is something changed? Commits in asserts aborts only the assert, at least it was the story in the past. So matching 'b'
fails (because the assert fails), but continuing the match from 'd'
results with a match in jit.
Btw if I remove the a?
both jit and interpreter (and perl) matches to 'd'. This is very confusing for me.
And me! I will take a closer look at the interpreter.
Aha. When a? is present, the pattern is compiled with "Starting code units: a b d", but without a? it has "First code unit = 'd'". This kind of behaviour is actually documented in pcre2pattern where the pattern /(*COMMIT)abc/ is discussed. This matches "xyzabc" both in the interpreter and Perl, but not in the interpreter if optimizations are turned off. Here's another oddity: if you run with -ac (auto callout) JIT and the interpreter agree for the original pattern.
'd' is included in both code unit sets, and 'd' should be a valid match regardless of 'a?'. But is seems to me the interpreter does not start a match from 'd'. Am I missing something?
The interpreter has all three letters a,b,d as possible starting points, so it starts at b. But why is JIT different with auto-callout?
I believe I understand this issue now.
1) Commit is not caught by positive asserts. I forgot about this. This is why I talked about it first, but nobody corrected me.
2) JIT is smart enough to recognize that the pattern must start with a
or d
, so it ignores the 'b' character. Callouts affects this analysis (negatively, I don't remember the rules).
So Philip was right. Start optimizations may confuse certain regex constructs, and this is acceptable. If proper execution is really needed, the no_start_optimize flag can be used:
/a?(?=b(*COMMIT)c|)d/no_start_optimize
yields no match for both interpreter and jit.
So this issue will be a wontfix.
Actually, I have "fixed" it! Sort of. I realized that the logic for determining possible starting code units could be improved a little bit. It now ignores positive assertions if the following item sets a mandatory starting character, so this pattern now just sets 'a' and 'd' as starting code units, instead of 'a', 'b', and 'd'. This makes the interpreter behave the same as the JIT (in the absence of callouts). (Just realized that I can probably improve things more by taking callouts into account.)
Now also works with callouts. I'm going to close this issue as completed.
This is the most reduced I could get this testcase; apologies for it being a rather lengthy one.
It seems that
a?
is necessary and cannot be replaced witha*
.c
following(*COMMIT)
is necessary.|
in the lookahead is necessary.d
is necessary.