Incorrect fix for (?0) with endanchored

addisoncrump commented 1 year ago

Discovered by #322.

Found another case where (?0) is improperly handled:

  re> /(?0)*./endanchored,ungreedy
data> abcd
 0: abcd
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position

I think this is related, if not a corner case of, #331.

carenas commented 1 year ago

I know this is coming from a fuzzer, but any random combination of symbols are not a valid regular expression, and IMHO "undefined behaviour" is a valid response, as PCRE2 is not a "validating" compiler.

What does the expression even mean?, and while you might be right that this might be a case where endanchored is not properly handled (at least by one of the matching engines), I am not sure if the added complexity actually help with real use cases.

zherczeg commented 1 year ago

There is no undefined behaviour here. This pattern is simply an infinite loop, which can be detected at compile time. Of course the compiler cannot detect all cases, but it detects many. This is useful for the user since the pattern is probably wrong.

addisoncrump commented 1 year ago

any random combination of symbols are not a valid regular expression

Understood, which is why I am filtering and manually reducing these testcases to representative, smallest reproducing testcases. As for why inconsistent behaviour is undesirable: this fuzzing strategy allows us to automatically and systematically discover incorrectness in either representation based on the other. While many of these bugs appear to be wild atypical usages on their own, resolving these issues allows for the automatic discovery of truly problematic cases, even if they do not appear in the test suite (see: #336).

IMO, if the behaviour is undefined, it should not be allowed. There is no reason to offer it to end users if there is no intent to keep it consistent. If it was truly the case that endanchored should not be used with recursive patterns, then it should produce a compile error.

PhilipHazel commented 1 year ago

I agree with @zherczeg that this is an infinite loop. The interpreter diagnoses this, but JIT does not. Is this a bug in JIT?

addisoncrump commented 1 year ago

Here's a non-infinite-loop version with the same behaviour:

  re> /|(?0)./auto_callout,endanchored
data> abcd
--->abcd
 +0 ^        |
 +1 ^        (?0)
 +0 ^        |
 +5 ^        .
 +6 ^^       End of pattern
 +1 ^        (?0)
 +0 ^        |
 +5 ^        .
 +6 ^^       End of pattern
 +5 ^^       .
 +6 ^ ^      End of pattern
 +1 ^        (?0)
 +0 ^        |
 +5 ^        .
 +6 ^^       End of pattern
 +5 ^^       .
 +6 ^ ^      End of pattern
 +5 ^ ^      .
 +6 ^  ^     End of pattern
 +1 ^        (?0)
 +0 ^        |
 +5 ^        .
 +6 ^^       End of pattern
 +5 ^^       .
 +6 ^ ^      End of pattern
 +5 ^ ^      .
 +6 ^  ^     End of pattern
 +5 ^  ^     .
 +6 ^   ^    End of pattern
 0: abcd
data> abcd\=no_jit
--->abcd
 +0 ^        |
 +1 ^        (?0)
 +0 ^        |
 +1 ^        (?0)
Failed: error -52: nested recursion at the same subject position

addisoncrump commented 1 year ago

It does seem that the JIT does exhibit some infinite loop behaviour on /|(?0)/endanchored, though. It does terminate ("JIT stack limit reached").

zherczeg commented 1 year ago

This is something I will definitely not change. It will fail with stack exhaustion.

addisoncrump commented 1 year ago

I note that the behaviour of the interpreter and the JIT also differs here; see especially:

data> abcd
--->abcd
 +0 ^        |
 +1 ^        (?0)
 +0 ^        |
 +5 ^        .

vs

data> abcd\=no_jit
--->abcd
 +0 ^        |
 +1 ^        (?0)
 +0 ^        |
 +1 ^        (?0)

Is this difference expected/permissible? It seems that any left-recursing pattern will pass in JIT, but fail in the interpreter. Does the JIT strip out (?0) when recursing inwards, then only if the match fails, recurses inward again?

carenas commented 1 year ago

I think it goes back to what I was saying originally (albeit not in a clear enough way) and that I though we needed to clarify to align better with expectations.

PCRE2 detects MOST misuses and returns an error that could be helpful to a user to identify the problem they had with their expressions, but it doesn't validate that the given expression is proper, so then it might end up showing different behaviour between the interpreter and JIT and that might not be a bug (specially when the only difference is the error code returned).

FWIW, that previous comment might not apply to this specific case, where we might have a bug, and Perl returns an error from a "translated" expression, which ironically also shows an error if applied to PCRE2.

% perl -Mre -e 'print "match\n" if shift =~ /(?0)*?.$/' abcd
Infinite recursion in regex at -e line 1.
% pcre2test -jit 
PCRE2 version 10.42 2022-12-11
  re> /(?0)*?.$/
data> abcd
Failed: error -46: JIT stack limit reached
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position
data>
 re> /(?0)*.$/ungreedy
data> abcd
Failed: error -46: JIT stack limit reached
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position

but generates different code when endanchored is involved, although also "consistent"


  re> /(?0)*?./BI,endanchored
------------------------------------------------------------------
        Bra
        Braminzero
        SBra
        Recurse
        KetRmin
        Any
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Options: endanchored
Subject length lower bound = 1
data> abcd
 0: abcd
data> abcd\=no_jit
 0: abcd
data> 
  re> /(?0)*?.$/BI
------------------------------------------------------------------
        Bra
        Braminzero
        SBra
        Recurse
        KetRmin
        Any
        $
        Ket
        End
------------------------------------------------------------------
Capture group count = 0
Subject length lower bound = 1
data> abcd
Failed: error -46: JIT stack limit reached
data> abcd\=no_jit
Failed: error -52: nested recursion at the same subject position

zherczeg commented 1 year ago

If the example is correct, the internal recursion is not called for some reason. Probably another typo somewhere. However, I will never add nested recursion check for the jit, it will simply fail with stack exhaustion.

addisoncrump commented 1 year ago

@carenas: See #331; this behaviour is expected to be different, because (?0) doesn't behave the same way (and shouldn't) when putting a $ at the end vs using endanchored, because (?0) includes $ in the recursed pattern, whereas endanchored is not part of the pattern. In this case, $ is strictly infinitely recursive/not matching.

Curiously, I do not see /(?0)*?./BI,endanchored successfully executing without JIT on the git version, but it does on 10.42? The behaviour must have changed between 10.42 and now.

addisoncrump commented 1 year ago

Aha, I just git bisected the revisions: this behaviour difference was introduced in ce5b604ebf7bf8fdd46b3026e19e1ea4792ca3a7!

$ git checkout ce5b604ebf7bf8fdd46b3026e19e1ea4792ca3a7
Note: switching to 'ce5b604ebf7bf8fdd46b3026e19e1ea4792ca3a7'.
$ make && (echo "/(?0)*?./BI,endanchored"; echo "abcd") | ./pcre2test | grep Failed
make  all-am
...
Failed: error -52: nested recursion at the same subject position
$ git checkout f5c4eb77f3ee473c85ce46f498a2485e1c73b180
Previous HEAD position was ce5b604 Fix PCRE2_ENDANCHORED behaviour in recursion in pcre2_match(). Fixes #331.
HEAD is now at f5c4eb7 jit: disable ASAN for arm64 SIMD helper (#306)
$ make && (echo "/(?0)*?./BI,endanchored"; echo "abcd") | ./pcre2test | grep Failed
make  all-am
...

addisoncrump commented 1 year ago

In the meantime, here's another example of a recursive pattern where the current behaviour is correct and reflects what I think is intended with endanchored:

  re> /((?1)*?a)$/   
data> aaaa
 0: aaaa
 1: aaaa
data> aaaa\=no_jit
 0: aaaa
 1: aaaa

Perl agrees:

$ perl -Mre -e 'print "match\n" if shift =~ /((?1)*?.)$/' abcd
match

I think the behaviour with endanchored is just slightly off is all. It is otherwise consistent.

carenas commented 1 year ago

I think the behaviour with endanchored is just slightly off

not sure if I got my description clearly enough before, but I would instead say "the behaviour with endanchored had a bug that was recently fixed in the interpreter, but that is still slightly off in JIT".

FWIW, I though your previous subject for this ticket was more accurate, specially if you consider there are actually 2 matching engines in PCRE2.

zherczeg commented 1 year ago

Hey @addisoncrump , you have successfully confused me with these flood of patterns.

 re> /|(?0)./endanchored
data> abcd
--->abcd

This is perfectly valid. Thanks to the empty alternative, the recursion can return with a match, which allows continuing with the next token, which is a dot.

The interpreter is overreacting here. It believes that because you match from the same position as before, match is not possible. But jit is simple (I will not change that), and continues the matching, which eventually results with a successful match. The termination condition in the interpreter is useful in the majority of the cases, but in this particular case, it does not work.

zherczeg commented 1 year ago

Btw I don't think the interpreter will change, since without this check, it may consume all memory and stack of the system, which is not desirable, when an actual infinite recursion is executed. I think recursion limits can be specified, but most applications does not use (aware of) this.

PhilipHazel commented 1 year ago

With some more thought (I hate recursion :-) I've realized that this behaviour has changed with a "fix" I recently committed to make the interpreter and JIT match some other case. The fix is wrong. I will now try to figure out what is the right code to make both cases work.

PhilipHazel commented 1 year ago

Having removed the previous incorrect patch, the interpreter now matches "d", as it did before recent activity. I do not at the moment know how to fix this because I can no longer remember exactly how the recursion stuff works (hence the previous bad fix). I will keep trying to understand it.

addisoncrump commented 1 year ago

you have successfully confused me with these flood of patterns.

Hah, sorry, I was experimenting with lots of different recursive patterns to find a root cause here. I think the |(?0). (or, the perl-testable version: (|(?1).)$) case demonstrates the issue most clearly: the JIT appears to try different depths of recursion, one at a time, by concretising the recursion as the left side of the alternation, then in the second case as the right side of the alternation.

Just to summarise what we know, here are all the patterns I have tested (PCRE2 on master, Perl v5.36.1) (click the dropdowns to view):

Not Perl-compatible (`(?0)` with endanchored)

Consistent between JIT and interpreter

Case 1: Whole-pattern recursion, endanchored, non-greedy zero-to-many recursions after: ``` re> /.(?0)*?/endanchored data> abcd 0: abcd data> abcd\=no_jit 0: abcd ``` Case 2: Whole-pattern recursion, endanchored, non-greedy zero-to-many recursions before (this issue): ``` re> /(?0)*?./endanchored data> abcd 0: abcd data> abcd\=no_jit 0: abcd ```

Inconsistent between JIT and interpreter

Case 3: Whole-pattern recursion, endanchored, recursing after (#331): ``` re> /|.(?0)/endanchored data> abcd 0: abcd data> abcd\=no_jit 0: d ``` Case 4: Whole-pattern recursion, endanchored, recursing before: ``` re> /|(?0)./endanchored data> abcd 0: abcd data> abcd\=no_jit 0: d ``` Case 5: Whole-pattern recursion, endanchored, non-greedy zero-or-one recursions after: ``` re> /.(?0)??/endanchored data> abcd 0: abcd data> abcd\=no_jit 0: cd ``` Case 6: Whole-pattern recursion, endanchored, non-greedy zero-or-one recursions before: ``` re> /(?0)??./endanchored data> abcd 0: abcd data> abcd\=no_jit 0: cd ```

Perl-compatible (`(?1)` with trailing `$`)

Consistent between everything

Case 7: Capture group recursion, explicit end anchor, recursion after: ``` re> /(|.(?1))$/ data> abcd 0: abcd 1: abcd data> abcd\=no_jit 0: abcd 1: abcd $ perl -Mre -e 'print "match\n" if shift =~ /(|.(?1))$/' abcd match ``` Case 8: Capture group recursion, explicit end anchor, non-greedy zero-to-many recursions after: ``` re> /(.(?1)*?)$/ data> abcd 0: abcd 1: abcd data> abcd\=no_jit 0: abcd 1: abcd $ perl -Mre -e 'print "match\n" if shift =~ /(.(?1)*?)$/' abcd match ``` Case 9: Capture group recursion, explicit end anchor, non-greedy zero-to-many recursions before: ``` re> /((?1)*?.)$/ data> abcd 0: abcd 1: abcd data> abcd\=no_jit 0: abcd 1: abcd $ perl -Mre -e 'print "match\n" if shift =~ /((?1)*?.)$/' abcd match ``` Case 10: Capture group recursion, explicit end anchor, non-greedy zero-or-one recursions after: ``` re> /(.(?1)??)$/ data> abcd 0: abcd 1: abcd data> abcd\=no_jit 0: abcd 1: abcd $ perl -Mre -e 'print "match\n" if shift =~ /(.(?1)??)$/' abcd match ```

Consistent between interpreter and Perl, but not JIT

Case 11: Capture group recursion, explicit end anchor, recursion before: ``` re> /(|(?1).)$/ data> abcd 0: abcd 1: abcd data> abcd\=no_jit Failed: error -52: nested recursion at the same subject position $ perl -Mre -e 'print "match\n" if shift =~ /(|(?1).)$/' abcd Infinite recursion in regex at -e line 1. ``` Case 12: Capture group recursion, explicit end anchor, non-greedy zero-or-one recursions before: ``` re> /((?1)??.)$/ data> abcd 0: abcd 1: abcd data> abcd\=no_jit Failed: error -52: nested recursion at the same subject position $ perl -Mre -e 'print "match\n" if shift =~ /((?1)??.)$/' abcd Infinite recursion in regex at -e line 1. ```

The results of testing between Perl and PCRE2 indicate that the interpreter is performing as expected w.r.t. infinite loop detection, even though these cases are not really infinite loops.

In essence, there are three options for remediation:

If one-to-one behaviour with Perl is desired, then cases 1, 2, 3, and 5 should match (see cases 8, 9, 7, and 10, respectively) and cases 4 and 6 should report infinite recursion (see cases 11 and 12, respectively). Note especially the difference between case 9 and 12, where Perl matches on case 9 but fails in case 12. No changes need to be made to recursion not including (?0).
If one-to-one behaviour with JIT is desired, then all cases should be matched including cases 11 and 12, which may be too significant of a change to be justified.
If recursion is to be treated as potential UB w.r.t. detection of infinite recursion and total one-to-one behaviour between interpreter and any other implementation is not desired/necessary, then the fix for #331 may be applied with no further changes. However, given case 9 (see corresponding case 2), it seems that there is still some difference between the behaviour of recursion with endanchored vs capture group recursion and explicit end anchor, which I would expect to be equivalent.

zherczeg commented 1 year ago

Thank you for the summary

PhilipHazel commented 1 year ago

I have now spent many hours trying to figure out what the interpreter is or should be doing with this pattern. The only benefit has been some additional optional debugging code, but that has not been enough to enlighten me. I am going to give up for the moment and work on other issues. However, my current belief is that both the interpreter and JIT may be "wrong", though perhaps it is I that is "wrong". Anyway, FWIW, here is my analysis of the behaviour of the endanchored pattern:

Pattern is: /|(?0)./endanchored Subject is: abcd

Start match: subject pointer is at "a".
Remember backtrack point 1 (subject is at "a")
Match first alternative
It matches an empty string
Go to end of pattern
Backtrack to 1 because endanchored test fails
Subject is still at "a"
Try the second alternative.
Recurse whole pattern.
Match first alternative
Remember backtrack point 2 (subject is at "a")
Match an empty string
Go to end of pattern
This time there is no endanchored check because we are in a recursion, not at the true end of the pattern.
Therefore the recursion completes.
Continue after (?0)
Match "." and this matches "a"
Subject is now at "b"
Reached end of pattern not in a recursion
Endanchored test fails

WHAT SHOULD HAPPEN (I think)

What I think should now happen is a backtrack to point 2, which resets the subject pointer to "a". This is now a nested recursion at the same point in the subject, which the interpreter will diagnose with error -52.

WHAT DOES HAPPEN

The interpreter backtracks to point 1, not point 2, but there are no more alternatives.
This causes a bumpalong to try again, starting at "b", which fails as before.
Eventually there is a match at "d".
The JIT seems to backtrack without resetting the subject to "a", and eventually gives a match for "abcd".
I don't think this is right because any backtrack following "." in the pattern should withdraw the character that it matched.

zherczeg commented 1 year ago

For me the nesting of this pattern looks like:

match with 0 level nesting: /|./ match with 1 level nesting: /|(?:|.)./ match with 2 level nesting: /|(?:|(?:|.).)./ match with 3 level nesting: /|(?:|(?:|(?:|.).).)./

The key thing is: when we don't achieve a match with the first (empty) alternative, we recursively run the pattern, and append a character after the recursive call (this is the second alternative). This way this pattern can match to any subject string (if we have enough stack). Another key thing is: the whole thing works because we have an empty alternative, which allows to select the right side of the alternative throughout the recursion.

So jit can match abcd after 4 level of nesting: the 4th level is simply matching to an empty string, which triggers the second alternative of the 3 previous nesting, and the second alternative of the original pattern.

zherczeg commented 1 year ago

Maybe this way the explanation is easier. The matching of the abcd looks like: first the empty alternative is matched, then the right alternative of the previous recursion is continued which appends a character, then the right alternative of the previous recursion is continued which appends a character, then the right ...

So we need 4 level of recursion, and we eventually reach this (since jit never stops recursions).

PhilipHazel commented 12 months ago

I have now refactored the handling of (?0) in pcre2_match() (see 86919c9) and the behaviour is as I would expect, as described above. It gives error -52 (infinite recursion). I'm afraid I still don't understand how the JIT manages to break the loop. Another data point: I constructed a Perl-compatible pattern that I believe should show the same behaviour. Both Perl and the interpreter diagnose a loop; the JIT does not.

... $ ./perltest.sh zz5 Perl v5.38.0 /(?:|(?0).)(?(R)|\z)/ Error: Infinite recursion in regex at (eval 1) line 1, line 1.

$ ./pcre2test zz5 PCRE2 version 10.43-DEV 2023-04-14 (8-bit) /(?:|(?0).)(?(R)|\z)/ abcd Failed: error -52: nested recursion at the same subject position

$ ./pcre2test -jit zz5 PCRE2 version 10.43-DEV 2023-04-14 (8-bit) /(?:|(?0).)(?(R)|\z)/ abcd 0: abcd ...

zherczeg commented 12 months ago

I am thinking how can I explain it. Think it as a staircase. Each step represents a letter, and each step also represents one more recursion level. Form any stair, if you go down to the bottom, you can read a string which length is the same as the number of steps you go down.

How this pattern solve this problem: first you go one step up, and then go down to the bottom, and realize you cannot match 4 character only one character long this way. So you retry and now you go two steps up, go back, and realize two character is still not four. Then you go 3, 4, ... any character long.

(?:| - this part allows you stop and go down (?0).) - this part is the stair, reads one character, and falls back to the previous stair

Since recursion now supports backtracking, if you reach the bottom, and there is no match, you can climb the stairs again (as an infinite loop).

PhilipHazel commented 12 months ago

Thank you for trying to explain it to me. I am afraid I am sufficiently old and stupid that I cannot get my head round it. I just can't see how to retain the matched character when backtracking. Oh, and incidentally, it isn't just the interpreter being aggressive: if I disable the check for repeated recursion at the same point it just recurses and recurses until it hits the match limit. Have you looked at my new example above? Is it the same explanation?

zherczeg commented 12 months ago

Yes, it is the same. The key point is that the first alternative does nothing, so it matches to everything, and breaks the recursion. The second part adds another recursion level and a character (the dot). Since we support backtracking into recursion, we can do a match for these patterns.

zherczeg commented 12 months ago

Another explanation idea: the pattern is a tricky variant of /a(?0)b/ which matches a{n}b{n} patterns. The 'a' is empty string and 'b' is "any" character here. This would normally be an infinite loop but adding a tricky '|' alternative breaks this.

PhilipHazel commented 12 months ago

Surely /a(?0)b/ never matches anything? It can match "a" then recurse, so it can match "a", then recurse.... but when it runs out of "a"s it must fail. And if "a" is an empty string, there's an infinite loop.

zherczeg commented 12 months ago

I am not sure I follow. Is it clear how this pattern works? It can be generalized: /(?:|(?0)abcd)/endanchored matches any abcdabcd...abcd string.

zherczeg commented 12 months ago

Basically /(?:|(?0)P)/endanchored is the same as (P)* but it requires 1+2+3+...+n = n*(n+1)/2 steps instead of n steps, which is surely inefficient, but far from an infinite loop. We made an O(n*n) algorithm for an O(n) problem.

PhilipHazel commented 12 months ago

I am totally confused. We may just have to live with the difference. I just cannot see how /(?:|(?0).)(?(R)|\z)/ (note: without endanchored) is not an infinite loop, and Perl agrees with me and the interpreter. Similarly for /(?:|(?0)abcd)/endanchored. Is anybody else following this discussion? Any other thoughts? I will try to create a diagram of how the interpreter interprets the Perl-compatible pattern.

zherczeg commented 12 months ago

The steps of /(?:|(?0)abcd)/endanchored should be:

1) Match empty branch -> if endanchored test pass, we have a match (empty string), otherwise backtrack 2) The second alternative starts a recursion (1st level) 3) The recursion matches to the empty alternative, so the 1st level recursion matches, and the matching of the original pattern continues 4) If abcd matches, and endanchored test pass, we have a match (abcd string), otherwise backtrack 5) The backtrack skips the "abcd", then backtracks to the 1st level recursion, where we matched the empty alternative. The backtrack of empty alternative is the second alternative, which creates another recursion (2nd level). 6) The 2nd level recursion matches the empty alternative, which matches (of course), and continue the match. 7) The continuation in this case is the continuation of the second alternative of the 1st level match 8) We match "abcd", and if it is successful, the matching of the original pattern continues 9) The original pattern matches "abcd" again, and if endanchored test pass, we have a match (abcdabcd string) 10) If not, we backtrack to the 1st level recursion 11) Then we backtrack to the 2nd level recursion 12) Since the empty alternative is exhausted, the second alternative is matched again, which creates another recursion (3rd level) 13) Then the empty alternative matches, and so on...

We try "abcdabcdabcd" this time, and it does not work, we try "abcdabcdabcdabcd" ....

zherczeg commented 12 months ago

Regex 101 also visualises the match (in pcre2 mode of course :P)

https://regex101.com/r/kEvqXg/1

PhilipHazel commented 12 months ago

AHA! This is the difference. You say 'The backtrack skips the "abcd"' but that is precisely what does NOT happen in the interpreter and Perl. A backtrack really does go back to the exact state that existed before.

zherczeg commented 12 months ago

I mean the backtrack of 'abcd' is moving four characters back. In jit this is implemented as doing nothing, because the alternative knows the location where the match started. Is there anything else the interpreter does?

zherczeg commented 12 months ago

From the perspective of the "abcd" part of the pattern, when "abcdabcdabcdabcd" matches, the jit does this:

match "abcd" revert "abcd" - (we are at string start) match "abcd" - match "abcd" revert "abcd" - revert "abcd" - (we are at string start) match "abcd" - match "abcd" - match "abcd" revert "abcd" - revert "abcd" - revert "abcd" - (we are at string start) match "abcd" - match "abcd" - match "abcd" - match "abcd" - and we have a match

PhilipHazel commented 12 months ago

The interpreter also knows the data for the backtracking state, so when it goes back, it finds itself recursing again at the start and hence the infinite loop. Oh dear, I don't think we are getting anywhere. I will email you the description of the interpreter's behaviour, which I've got in a separate file, but I suspect that it will not help.

zherczeg commented 12 months ago

Based on your mail, the interpreter and jit does exactly the same thing. However, interpreter recognizes, that there is a recursion from the same starting point, which it considers an inifinite loop error. Jit does not care about this, and just does the recursion again. This particular pattern for this particular input will not be an infinite recursion, so it stops eventually with a match. The interpreter check is probably useful in other cases. You are right, maybe we need to live with this difference.

PhilipHazel commented 12 months ago

OOOOH! I've just tried suppressing the test in the interpreter that gives the error, and wow, it does do the match. When I tried this before I had patched the interpreter it just looped for ever. This is progress! I wonder if there is some other condition that could be used in the interpreter to let it proceed. I'll think about it, but have to do some other stuff now. Thanks for persisting in looking at this.

PhilipHazel commented 12 months ago

I have found a way of making the interpreter be more intelligent, but it requires increasing the size of the backup frames by one pointer (typically 8 bytes). The logic is to remember how much of the subject has been consulted so far; if at a repeat recursion this is different from last time, allow the recursion to continue, because something has changed. A more brute-force approach is just to put a limit on the depth of nesting such recursions, but what limit is sensible? 10? 100? 1000? I would rather increase the frame size. It is currently 128 bytes plus 16 for every capturing group (on a 64-bit system). Increasing it to 136 would only impact matches that use a great deal of heap. What do you think?

zherczeg commented 12 months ago

tl;dr; it is not worth adding extra logic which slows down matching and creates no value for users

I would leave everything as it is now. I remember we have discussed this a long time ago. Infinite loop patterns has no practical use for users. They will get an error for those patterns, and they know they need to change them. Even if jit has megabytes of stack, infinite patterns exhaust them very fast. So for implementing complex logic which gives no advantage to users is pointless. I know the interpreter works differently, and may exhaust system memory, which can cause issues, so it needs an extra layer of protection. But at those times even with this protection, infinite loop patterns were still possible, so in the end, the maximum recursion limit was the final layer of protection.

PhilipHazel commented 12 months ago

The code change is small and I do not believe it will have any effect on performance. For each recursion there is one added statement:

F->lup_recurse = mb->last_used_ptr;

And for each nested recursion, instead of

if (Feptr == P->eptr) return PCRE2_ERROR_RECURSELOOP;

we have

if (Feptr == P->eptr && mb->last_used_ptr == P->lup_recurse) return PCRE2_ERROR_RECURSELOOP;

I think I will make this change because it brings the interpreter into line with the JIT and is also logical because it is a better implementation of "diagnose a loop if there's a nested recursion at the same point in the subject and nothing else has changed".

addisoncrump commented 12 months ago

I will test this change now. Thank you for your persistence on this!

addisoncrump commented 12 months ago

So, I tested again; it seems the fix is not yet complete:

$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
  re> /|(?0)a/endanchored,auto_callout
data> aaaa
--->aaaa
 +0 ^        |
 +1 ^        (?0)
 +0 ^        |
 +5 ^        a
 +6 ^^       End of pattern
 +1 ^        (?0)
 +0 ^        |
 +5 ^        a
 +6 ^^       End of pattern
 +5 ^^       a
 +6 ^ ^      End of pattern
 +1 ^        (?0)
 +0 ^        |
 +5 ^        a
 +6 ^^       End of pattern
 +5 ^^       a
 +6 ^ ^      End of pattern
 +5 ^ ^      a
 +6 ^  ^     End of pattern
 +1 ^        (?0)
 +0 ^        |
 +5 ^        a
 +6 ^^       End of pattern
 +5 ^^       a
 +6 ^ ^      End of pattern
 +5 ^ ^      a
 +6 ^  ^     End of pattern
 +5 ^  ^     a
 +6 ^   ^    End of pattern
 0: aaaa
data> aaaa\=no_jit
--->aaaa
 +0 ^        |
 +1 ^        (?0)
 +0 ^        |
 +5 ^        a
 +6 ^^       End of pattern
 +1 ^        (?0)
Failed: error -52: nested recursion at the same subject position

PhilipHazel commented 12 months ago

Er, no, I hadn't put in the final fix. I've just committed 8e83acc which should bring pcre2_match() into line with the JIT. I'm glad we've finally resolved this issue. It took my aged brain an age to realize that it was the different coding for (?0) recursion versus (?1) etc that was the main problem. This occurred because, when PCRE1 was first implemented and there was no recursion, I made it compile each pattern as a non-capturing group. If I had made it into capturing group 0 it might have made things easier later, but hindsight is a wonderful thing and it's too late now.

addisoncrump commented 12 months ago

Oops, sorry, I'll test that now, then.

PCRE2Project / pcre2

Incorrect fix for (?0) with endanchored #334