Closed addisoncrump closed 11 months ago
I know this is coming from a fuzzer, but any random combination of symbols are not a valid regular expression, and IMHO "undefined behaviour" is a valid response, as PCRE2 is not a "validating" compiler.
What does the expression even mean?, and while you might be right that this might be a case where endanchored
is not properly handled (at least by one of the matching engines), I am not sure if the added complexity actually help with real use cases.
There is no undefined behaviour here. This pattern is simply an infinite loop, which can be detected at compile time. Of course the compiler cannot detect all cases, but it detects many. This is useful for the user since the pattern is probably wrong.
any random combination of symbols are not a valid regular expression
Understood, which is why I am filtering and manually reducing these testcases to representative, smallest reproducing testcases. As for why inconsistent behaviour is undesirable: this fuzzing strategy allows us to automatically and systematically discover incorrectness in either representation based on the other. While many of these bugs appear to be wild atypical usages on their own, resolving these issues allows for the automatic discovery of truly problematic cases, even if they do not appear in the test suite (see: #336).
IMO, if the behaviour is undefined, it should not be allowed. There is no reason to offer it to end users if there is no intent to keep it consistent. If it was truly the case that endanchored should not be used with recursive patterns, then it should produce a compile error.
I agree with @zherczeg that this is an infinite loop. The interpreter diagnoses this, but JIT does not. Is this a bug in JIT?
Here's a non-infinite-loop version with the same behaviour:
re> /|(?0)./auto_callout,endanchored
data> abcd
--->abcd
+0 ^ |
+1 ^ (?0)
+0 ^ |
+5 ^ .
+6 ^^ End of pattern
+1 ^ (?0)
+0 ^ |
+5 ^ .
+6 ^^ End of pattern
+5 ^^ .
+6 ^ ^ End of pattern
+1 ^ (?0)
+0 ^ |
+5 ^ .
+6 ^^ End of pattern
+5 ^^ .
+6 ^ ^ End of pattern
+5 ^ ^ .
+6 ^ ^ End of pattern
+1 ^ (?0)
+0 ^ |
+5 ^ .
+6 ^^ End of pattern
+5 ^^ .
+6 ^ ^ End of pattern
+5 ^ ^ .
+6 ^ ^ End of pattern
+5 ^ ^ .
+6 ^ ^ End of pattern
0: abcd
data> abcd\=no_jit
--->abcd
+0 ^ |
+1 ^ (?0)
+0 ^ |
+1 ^ (?0)
Failed: error -52: nested recursion at the same subject position
It does seem that the JIT does exhibit some infinite loop behaviour on /|(?0)/endanchored
, though. It does terminate ("JIT stack limit reached").
This is something I will definitely not change. It will fail with stack exhaustion.
I note that the behaviour of the interpreter and the JIT also differs here; see especially:
data> abcd
--->abcd
+0 ^ |
+1 ^ (?0)
+0 ^ |
+5 ^ .
vs
data> abcd\=no_jit
--->abcd
+0 ^ |
+1 ^ (?0)
+0 ^ |
+1 ^ (?0)
Is this difference expected/permissible? It seems that any left-recursing pattern will pass in JIT, but fail in the interpreter. Does the JIT strip out (?0)
when recursing inwards, then only if the match fails, recurses inward again?
I think it goes back to what I was saying originally (albeit not in a clear enough way) and that I though we needed to clarify to align better with expectations.
PCRE2 detects MOST misuses and returns an error that could be helpful to a user to identify the problem they had with their expressions, but it doesn't validate that the given expression is proper, so then it might end up showing different behaviour between the interpreter and JIT and that might not be a bug (specially when the only difference is the error code returned).
FWIW, that previous comment might not apply to this specific case, where we might have a bug, and Perl returns an error from a "translated" expression, which ironically also shows an error if applied to PCRE2.
% perl -Mre -e 'print "match\n" if shift =~ /(?0)*?.$/' abcd
Infinite recursion in regex at -e line 1.
% pcre2test -jit
PCRE2 version 10.42 2022-12-11
re> /(?0)*?.$/
data> abcd
Failed: error -46: JIT stack limit reached
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position
data>
re> /(?0)*.$/ungreedy
data> abcd
Failed: error -46: JIT stack limit reached
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position
but generates different code when endanchored
is involved, although also "consistent"
re> /(?0)*?./BI,endanchored
------------------------------------------------------------------
Bra
Braminzero
SBra
Recurse
KetRmin
Any
Ket
End
------------------------------------------------------------------
Capture group count = 0
Options: endanchored
Subject length lower bound = 1
data> abcd
0: abcd
data> abcd\=no_jit
0: abcd
data>
re> /(?0)*?.$/BI
------------------------------------------------------------------
Bra
Braminzero
SBra
Recurse
KetRmin
Any
$
Ket
End
------------------------------------------------------------------
Capture group count = 0
Subject length lower bound = 1
data> abcd
Failed: error -46: JIT stack limit reached
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position
If the example is correct, the internal recursion is not called for some reason. Probably another typo somewhere. However, I will never add nested recursion check for the jit, it will simply fail with stack exhaustion.
@carenas: See #331; this behaviour is expected to be different, because (?0)
doesn't behave the same way (and shouldn't) when putting a $
at the end vs using endanchored, because (?0)
includes $
in the recursed pattern, whereas endanchored is not part of the pattern. In this case, $
is strictly infinitely recursive/not matching.
Curiously, I do not see /(?0)*?./BI,endanchored
successfully executing without JIT on the git version, but it does on 10.42? The behaviour must have changed between 10.42 and now.
Aha, I just git bisect
ed the revisions: this behaviour difference was introduced in ce5b604ebf7bf8fdd46b3026e19e1ea4792ca3a7!
$ git checkout ce5b604ebf7bf8fdd46b3026e19e1ea4792ca3a7
Note: switching to 'ce5b604ebf7bf8fdd46b3026e19e1ea4792ca3a7'.
$ make && (echo "/(?0)*?./BI,endanchored"; echo "abcd") | ./pcre2test | grep Failed
make all-am
...
Failed: error -52: nested recursion at the same subject position
$ git checkout f5c4eb77f3ee473c85ce46f498a2485e1c73b180
Previous HEAD position was ce5b604 Fix PCRE2_ENDANCHORED behaviour in recursion in pcre2_match(). Fixes #331.
HEAD is now at f5c4eb7 jit: disable ASAN for arm64 SIMD helper (#306)
$ make && (echo "/(?0)*?./BI,endanchored"; echo "abcd") | ./pcre2test | grep Failed
make all-am
...
In the meantime, here's another example of a recursive pattern where the current behaviour is correct and reflects what I think is intended with endanchored:
re> /((?1)*?a)$/
data> aaaa
0: aaaa
1: aaaa
data> aaaa\=no_jit
0: aaaa
1: aaaa
Perl agrees:
$ perl -Mre -e 'print "match\n" if shift =~ /((?1)*?.)$/' abcd
match
I think the behaviour with endanchored is just slightly off is all. It is otherwise consistent.
I think the behaviour with endanchored is just slightly off
not sure if I got my description clearly enough before, but I would instead say "the behaviour with endanchored had a bug that was recently fixed in the interpreter, but that is still slightly off in JIT".
FWIW, I though your previous subject for this ticket was more accurate, specially if you consider there are actually 2 matching engines in PCRE2.
Hey @addisoncrump , you have successfully confused me with these flood of patterns.
re> /|(?0)./endanchored
data> abcd
--->abcd
This is perfectly valid. Thanks to the empty alternative, the recursion can return with a match, which allows continuing with the next token, which is a dot.
The interpreter is overreacting here. It believes that because you match from the same position as before, match is not possible. But jit is simple (I will not change that), and continues the matching, which eventually results with a successful match. The termination condition in the interpreter is useful in the majority of the cases, but in this particular case, it does not work.
Btw I don't think the interpreter will change, since without this check, it may consume all memory and stack of the system, which is not desirable, when an actual infinite recursion is executed. I think recursion limits can be specified, but most applications does not use (aware of) this.
With some more thought (I hate recursion :-) I've realized that this behaviour has changed with a "fix" I recently committed to make the interpreter and JIT match some other case. The fix is wrong. I will now try to figure out what is the right code to make both cases work.
Having removed the previous incorrect patch, the interpreter now matches "d", as it did before recent activity. I do not at the moment know how to fix this because I can no longer remember exactly how the recursion stuff works (hence the previous bad fix). I will keep trying to understand it.
you have successfully confused me with these flood of patterns.
Hah, sorry, I was experimenting with lots of different recursive patterns to find a root cause here. I think the |(?0).
(or, the perl-testable version: (|(?1).)$
) case demonstrates the issue most clearly: the JIT appears to try different depths of recursion, one at a time, by concretising the recursion as the left side of the alternation, then in the second case as the right side of the alternation.
Just to summarise what we know, here are all the patterns I have tested (PCRE2 on master, Perl v5.36.1) (click the dropdowns to view):
The results of testing between Perl and PCRE2 indicate that the interpreter is performing as expected w.r.t. infinite loop detection, even though these cases are not really infinite loops.
In essence, there are three options for remediation:
(?0)
.Thank you for the summary
I have now spent many hours trying to figure out what the interpreter is or should be doing with this pattern. The only benefit has been some additional optional debugging code, but that has not been enough to enlighten me. I am going to give up for the moment and work on other issues. However, my current belief is that both the interpreter and JIT may be "wrong", though perhaps it is I that is "wrong". Anyway, FWIW, here is my analysis of the behaviour of the endanchored pattern:
Pattern is: /|(?0)./endanchored Subject is: abcd
Start match: subject pointer is at "a".
Remember backtrack point 1 (subject is at "a")
Match first alternative
It matches an empty string
Go to end of pattern
Backtrack to 1 because endanchored test fails
Subject is still at "a"
Try the second alternative.
Recurse whole pattern.
Match first alternative
Remember backtrack point 2 (subject is at "a")
Match an empty string
Go to end of pattern
This time there is no endanchored check because we are in a recursion, not at the true end of the pattern.
Therefore the recursion completes.
Continue after (?0)
Match "." and this matches "a"
Subject is now at "b"
Reached end of pattern not in a recursion
Endanchored test fails
WHAT SHOULD HAPPEN (I think)
WHAT DOES HAPPEN
The interpreter backtracks to point 1, not point 2, but there are no more alternatives.
This causes a bumpalong to try again, starting at "b", which fails as before.
Eventually there is a match at "d".
The JIT seems to backtrack without resetting the subject to "a", and eventually gives a match for "abcd".
I don't think this is right because any backtrack following "." in the pattern should withdraw the character that it matched.
For me the nesting of this pattern looks like:
match with 0 level nesting: /|./ match with 1 level nesting: /|(?:|.)./ match with 2 level nesting: /|(?:|(?:|.).)./ match with 3 level nesting: /|(?:|(?:|(?:|.).).)./
The key thing is: when we don't achieve a match with the first (empty) alternative, we recursively run the pattern, and append a character after the recursive call (this is the second alternative). This way this pattern can match to any subject string (if we have enough stack). Another key thing is: the whole thing works because we have an empty alternative, which allows to select the right side of the alternative throughout the recursion.
So jit can match abcd after 4 level of nesting: the 4th level is simply matching to an empty string, which triggers the second alternative of the 3 previous nesting, and the second alternative of the original pattern.
Maybe this way the explanation is easier. The matching of the abcd looks like: first the empty alternative is matched, then the right alternative of the previous recursion is continued which appends a character, then the right alternative of the previous recursion is continued which appends a character, then the right ...
So we need 4 level of recursion, and we eventually reach this (since jit never stops recursions).
I have now refactored the handling of (?0) in pcre2_match() (see 86919c9) and the behaviour is as I would expect, as described above. It gives error -52 (infinite recursion). I'm afraid I still don't understand how the JIT manages to break the loop. Another data point: I constructed a Perl-compatible pattern that I believe should show the same behaviour. Both Perl and the interpreter diagnose a loop; the JIT does not.
...
$ ./perltest.sh zz5
Perl v5.38.0
/(?:|(?0).)(?(R)|\z)/
Error: Infinite recursion in regex at (eval 1) line 1,
$ ./pcre2test zz5 PCRE2 version 10.43-DEV 2023-04-14 (8-bit) /(?:|(?0).)(?(R)|\z)/ abcd Failed: error -52: nested recursion at the same subject position
$ ./pcre2test -jit zz5 PCRE2 version 10.43-DEV 2023-04-14 (8-bit) /(?:|(?0).)(?(R)|\z)/ abcd 0: abcd ...
I am thinking how can I explain it. Think it as a staircase. Each step represents a letter, and each step also represents one more recursion level. Form any stair, if you go down to the bottom, you can read a string which length is the same as the number of steps you go down.
How this pattern solve this problem: first you go one step up, and then go down to the bottom, and realize you cannot match 4 character only one character long this way. So you retry and now you go two steps up, go back, and realize two character is still not four. Then you go 3, 4, ... any character long.
(?:|
- this part allows you stop and go down
(?0).)
- this part is the stair, reads one character, and falls back to the previous stair
Since recursion now supports backtracking, if you reach the bottom, and there is no match, you can climb the stairs again (as an infinite loop).
Thank you for trying to explain it to me. I am afraid I am sufficiently old and stupid that I cannot get my head round it. I just can't see how to retain the matched character when backtracking. Oh, and incidentally, it isn't just the interpreter being aggressive: if I disable the check for repeated recursion at the same point it just recurses and recurses until it hits the match limit. Have you looked at my new example above? Is it the same explanation?
Yes, it is the same. The key point is that the first alternative does nothing, so it matches to everything, and breaks the recursion. The second part adds another recursion level and a character (the dot). Since we support backtracking into recursion, we can do a match for these patterns.
Another explanation idea: the pattern is a tricky variant of /a(?0)b/
which matches a{n}b{n} patterns. The 'a' is empty string and 'b' is "any" character here. This would normally be an infinite loop but adding a tricky '|' alternative breaks this.
Surely /a(?0)b/ never matches anything? It can match "a" then recurse, so it can match "a", then recurse.... but when it runs out of "a"s it must fail. And if "a" is an empty string, there's an infinite loop.
I am not sure I follow. Is it clear how this pattern works? It can be generalized: /(?:|(?0)abcd)/endanchored
matches any abcdabcd...abcd
string.
Basically /(?:|(?0)P)/endanchored
is the same as (P)*
but it requires 1+2+3+...+n = n*(n+1)/2
steps instead of n
steps, which is surely inefficient, but far from an infinite loop. We made an O(n*n)
algorithm for an O(n)
problem.
I am totally confused. We may just have to live with the difference. I just cannot see how /(?:|(?0).)(?(R)|\z)/ (note: without endanchored) is not an infinite loop, and Perl agrees with me and the interpreter. Similarly for /(?:|(?0)abcd)/endanchored. Is anybody else following this discussion? Any other thoughts? I will try to create a diagram of how the interpreter interprets the Perl-compatible pattern.
The steps of /(?:|(?0)abcd)/endanchored
should be:
1) Match empty branch -> if endanchored test pass, we have a match (empty string), otherwise backtrack 2) The second alternative starts a recursion (1st level) 3) The recursion matches to the empty alternative, so the 1st level recursion matches, and the matching of the original pattern continues 4) If abcd matches, and endanchored test pass, we have a match (abcd string), otherwise backtrack 5) The backtrack skips the "abcd", then backtracks to the 1st level recursion, where we matched the empty alternative. The backtrack of empty alternative is the second alternative, which creates another recursion (2nd level). 6) The 2nd level recursion matches the empty alternative, which matches (of course), and continue the match. 7) The continuation in this case is the continuation of the second alternative of the 1st level match 8) We match "abcd", and if it is successful, the matching of the original pattern continues 9) The original pattern matches "abcd" again, and if endanchored test pass, we have a match (abcdabcd string) 10) If not, we backtrack to the 1st level recursion 11) Then we backtrack to the 2nd level recursion 12) Since the empty alternative is exhausted, the second alternative is matched again, which creates another recursion (3rd level) 13) Then the empty alternative matches, and so on...
We try "abcdabcdabcd" this time, and it does not work, we try "abcdabcdabcdabcd" ....
Regex 101 also visualises the match (in pcre2 mode of course :P)
AHA! This is the difference. You say 'The backtrack skips the "abcd"' but that is precisely what does NOT happen in the interpreter and Perl. A backtrack really does go back to the exact state that existed before.
I mean the backtrack of 'abcd' is moving four characters back. In jit this is implemented as doing nothing, because the alternative knows the location where the match started. Is there anything else the interpreter does?
From the perspective of the "abcd" part of the pattern, when "abcdabcdabcdabcd" matches, the jit does this:
match "abcd" revert "abcd" - (we are at string start) match "abcd" - match "abcd" revert "abcd" - revert "abcd" - (we are at string start) match "abcd" - match "abcd" - match "abcd" revert "abcd" - revert "abcd" - revert "abcd" - (we are at string start) match "abcd" - match "abcd" - match "abcd" - match "abcd" - and we have a match
The interpreter also knows the data for the backtracking state, so when it goes back, it finds itself recursing again at the start and hence the infinite loop. Oh dear, I don't think we are getting anywhere. I will email you the description of the interpreter's behaviour, which I've got in a separate file, but I suspect that it will not help.
Based on your mail, the interpreter and jit does exactly the same thing. However, interpreter recognizes, that there is a recursion from the same starting point, which it considers an inifinite loop error. Jit does not care about this, and just does the recursion again. This particular pattern for this particular input will not be an infinite recursion, so it stops eventually with a match. The interpreter check is probably useful in other cases. You are right, maybe we need to live with this difference.
OOOOH! I've just tried suppressing the test in the interpreter that gives the error, and wow, it does do the match. When I tried this before I had patched the interpreter it just looped for ever. This is progress! I wonder if there is some other condition that could be used in the interpreter to let it proceed. I'll think about it, but have to do some other stuff now. Thanks for persisting in looking at this.
I have found a way of making the interpreter be more intelligent, but it requires increasing the size of the backup frames by one pointer (typically 8 bytes). The logic is to remember how much of the subject has been consulted so far; if at a repeat recursion this is different from last time, allow the recursion to continue, because something has changed. A more brute-force approach is just to put a limit on the depth of nesting such recursions, but what limit is sensible? 10? 100? 1000? I would rather increase the frame size. It is currently 128 bytes plus 16 for every capturing group (on a 64-bit system). Increasing it to 136 would only impact matches that use a great deal of heap. What do you think?
tl;dr; it is not worth adding extra logic which slows down matching and creates no value for users
I would leave everything as it is now. I remember we have discussed this a long time ago. Infinite loop patterns has no practical use for users. They will get an error for those patterns, and they know they need to change them. Even if jit has megabytes of stack, infinite patterns exhaust them very fast. So for implementing complex logic which gives no advantage to users is pointless. I know the interpreter works differently, and may exhaust system memory, which can cause issues, so it needs an extra layer of protection. But at those times even with this protection, infinite loop patterns were still possible, so in the end, the maximum recursion limit was the final layer of protection.
The code change is small and I do not believe it will have any effect on performance. For each recursion there is one added statement:
F->lup_recurse = mb->last_used_ptr;
And for each nested recursion, instead of
if (Feptr == P->eptr) return PCRE2_ERROR_RECURSELOOP;
we have
if (Feptr == P->eptr && mb->last_used_ptr == P->lup_recurse) return PCRE2_ERROR_RECURSELOOP;
I think I will make this change because it brings the interpreter into line with the JIT and is also logical because it is a better implementation of "diagnose a loop if there's a nested recursion at the same point in the subject and nothing else has changed".
I will test this change now. Thank you for your persistence on this!
So, I tested again; it seems the fix is not yet complete:
$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
re> /|(?0)a/endanchored,auto_callout
data> aaaa
--->aaaa
+0 ^ |
+1 ^ (?0)
+0 ^ |
+5 ^ a
+6 ^^ End of pattern
+1 ^ (?0)
+0 ^ |
+5 ^ a
+6 ^^ End of pattern
+5 ^^ a
+6 ^ ^ End of pattern
+1 ^ (?0)
+0 ^ |
+5 ^ a
+6 ^^ End of pattern
+5 ^^ a
+6 ^ ^ End of pattern
+5 ^ ^ a
+6 ^ ^ End of pattern
+1 ^ (?0)
+0 ^ |
+5 ^ a
+6 ^^ End of pattern
+5 ^^ a
+6 ^ ^ End of pattern
+5 ^ ^ a
+6 ^ ^ End of pattern
+5 ^ ^ a
+6 ^ ^ End of pattern
0: aaaa
data> aaaa\=no_jit
--->aaaa
+0 ^ |
+1 ^ (?0)
+0 ^ |
+5 ^ a
+6 ^^ End of pattern
+1 ^ (?0)
Failed: error -52: nested recursion at the same subject position
Er, no, I hadn't put in the final fix. I've just committed 8e83acc which should bring pcre2_match() into line with the JIT. I'm glad we've finally resolved this issue. It took my aged brain an age to realize that it was the different coding for (?0) recursion versus (?1) etc that was the main problem. This occurred because, when PCRE1 was first implemented and there was no recursion, I made it compile each pattern as a non-capturing group. If I had made it into capturing group 0 it might have made things easier later, but hindsight is a wonderful thing and it's too late now.
Oops, sorry, I'll test that now, then.
Discovered by #322.
Found another case where
(?0)
is improperly handled:I think this is related, if not a corner case of, #331.