PCRE2Project / pcre2

PCRE2 development is now based here.
Other
922 stars 194 forks source link

JIT/non-JIT match difference with newlines when anchored, firstline #323

Closed addisoncrump closed 1 year ago

addisoncrump commented 1 year ago

Found with #322.

The regex \n does not match the same strings when used with and without JIT when the anchored and firstline options are enabled.

sh-5.2$ xxd newline_crash 
00000000: 0000 0000 0001 0080 0a                   .........
sh-5.2$ ./pcre2_fuzzer newline_crash
Encountered failure while performing match errorcode comparison; context:
Pattern/sample string (hex encoded): 0a
Compile options 80100100 never_backslash_c,anchored,firstline
Match options 00002000
Non-JIT'd operation emitted an error: no match
JIT'd operation did not emit an error.
1 matches discovered by JIT'd regex:
Match 0 (hex encoded): 0a

Documentation suggests that the JIT implementation is correct here (emphasis mine):

PCRE2_FIRSTLINE: If this option is set, the start of an unanchored pattern match must be before or at the first newline in the subject string following the start of matching, though the matched text may continue over the newline.

PCRE2_ANCHORED: If this bit is set, the pattern is forced to be "anchored", that is, it is constrained to match only at the first matching point in the string that is being searched (the "subject string"). This effect can also be achieved by appropriate constructs in the pattern itself, which is the only way to do it in Perl.

Not sure where to add the testcase here. This behaviour only appears when both anchored and firstline are set.

zherczeg commented 1 year ago

Regardless of this difference, setting both bit seems pointless, I would consider it an error.

PhilipHazel commented 1 year ago

I agree. The documentation states explicitly that firstline applies to unanchored patterns, so either it should be diagnosed as an error, or it should be ignored for anchored patterns. I favour ignoring it because patterns can become auto-anchored even without having the anchored flag set.

addisoncrump commented 1 year ago

So, what should change? Should I enforce that anchored and firstline are an invalid combination in the fuzzer, or does the behaviour of the library need to change?

PhilipHazel commented 1 year ago

The interpreter needs to change. I have not yet looked at the code but I suspect it might just be an off-by-one bug. Then we can update the documentation to point out that firstline is pointless for an anchored pattern. But I won't get to this till tomorrow or possibly later today as I have to go out now.

PhilipHazel commented 1 year ago

Investigation of pcre2_match() has discovered a sort of off-by-one bug, but somewhat more complicated because it's in the start-of-match optimizations (if you turn off optimization there's no bug). I believe I now know how to fix this (also in dfa_match) but I'm going to sleep on it and see if I still think the same tomorrow. :-)

PhilipHazel commented 1 year ago

My instinct not to rush ahead proved right. What I was doing yesterday was nonsense. The correct fix is much simpler and I have now committed it (52cc4ff).