PCRE2Project / pcre2

PCRE2 development is now based here.
Other
922 stars 194 forks source link

JIT/non-JIT match difference with anchored regexes using (*SKIP) #326

Closed addisoncrump closed 11 months ago

addisoncrump commented 1 year ago

Discovered by #322.

The pattern A(*SKIP)A| (number of As on either side may be replaced by any non-zero amount of literals) inconsistently matches itself (and I think any string) in combination with the anchored flag being set:

sh-5.2$ xxd skip_crash 
00000000: 0000 0000 0000 0080 4128 2a53 4b49 5029  ........A(*SKIP)
00000010: 417c                                     A|
sh-5.2$ ./pcre2_fuzzer skip_crash 
Encountered failure while performing match errorcode comparison; context:
Pattern/sample string (hex encoded): 41282a534b495029417c
Compile options 80100000 never_backslash_c,anchored
Match options 00002000
Non-JIT'd operation emitted an error: no match (-1)
JIT'd operation did not emit an error.
1 matches discovered by JIT'd regex:
Match 0 (hex encoded): 

I suspect that in this case, the JIT code emitted for SKIP does not consider anchoring, but I haven't investigated further.

PhilipHazel commented 1 year ago

I think this might be a JIT bug. There is definitely something wrong, as shown by this test:

$ ./pcre2test -jit zz PCRE2 version 10.43-DEV 2023-04-14 (8-bit) /A(SKIP)A|/anchored A(SKIP)A|\=aftertext 0: 0+ (*SKIP)A|

The "aftertext" option causes pcre2test to output the text that follows the match and the output shows that it is matching an empty string following the first A, which must be wrong because the pattern is anchored.

addisoncrump commented 1 year ago

Potentially related case that the fuzzer spit out just now:

  re> /(*ACCEPT)*j/endanchored
data> j\=notempty
No match
data> j\=no_jit,notempty
 0: j
zherczeg commented 1 year ago

For me (*ACCEPT)* looks like a parse error. Maybe some perl mess.

  re> /(*ACCEPT)*j/B
------------------------------------------------------------------
        Bra
        Brazero
        SBra
        *ACCEPT
        KetRmax
        j
        Ket
        End
------------------------------------------------------------------

It looks like it is /(?:(ACCEPT))j/, and that should be valid.

zherczeg commented 1 year ago

Anyway, j cannot be matched since (*ACCEPT) stops matching the pattern, and the result is always an empty string here. With notempty the pattern never matches.

PhilipHazel commented 1 year ago

I think (ACCEPT) should provoke an error "qualifier does not follow a repeatable item" just like (SKIP) does, though oddly Perl does not give an error for these.

zherczeg commented 1 year ago

I agree. So there are two issues here: one is parser related, the other is execution (since /(?:(ACCEPT))j/ is valid, and generates the same code as /(*ACCEPT)*j/)

addisoncrump commented 1 year ago

Another case with THEN for the repertoire:

$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
  re> /(?<!(*THEN)a|)/
data> b
 0: 
data> b\=no_jit
No match
addisoncrump commented 1 year ago

Finally with PRUNE:


$ ~/git/pcre2/pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
  re> /.*(*PRUNE)a|/  
data> a
 0: 
data> a\=no_jit
No match
PhilipHazel commented 1 year ago

On checking the history, I found that quantifying (ACCEPT) is explicitly allowed. The ChangeLog entry for 10.34 says this: "Allow (ACCEPT) to be quantified, because an ungreedy quantifier with a zero minimum is potentially useful." Also, it seems that recent changes have brought JIT and the interpreter into agreement for the (*ACCEPT) example:

/(ACCEPT)j/endanchored j\=notempty No match

/(ACCEPT)j/endanchored,jitverify j\=notempty No match (JIT)

The (PRUNE) example is also OK. (THEN) is still an issue that seems to be an interpreter bug; I'm looking into it.

PhilipHazel commented 1 year ago

I think the interpreter is correct here:

/(?<!(*THEN)a|)/ b No match

/(?<!(*THEN)a|)/jitverify b 0: (JIT)

(*THEN) at the start of a branch is effectively a no-op. The lookbehind should match an empty string, but it's a negative lookbehind so it should therefore fail. And I can confirm that Perl agrees with the interpreter. So I think this is a JIT issue.

zherczeg commented 1 year ago

Fixed in e76eae6

Another victim of vreverse. I knew it will be painful.

addisoncrump commented 1 year ago

Fix is incomplete:

$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
  re> /A(*SKIP)\n|/anchored
data> AB\n
 0: 
data> AB\n\=no_jit
No match
zherczeg commented 1 year ago

I just fixed the last test, this is unrelated. Btw why do you combine skip with anchored? They do the opposite things.

addisoncrump commented 1 year ago

I am automatically looking for these cases with a fuzzer. Handling these edge cases now means that any future changes which cause JIT/interpreter to be inconsistent will automatically be flagged by OSS-Fuzz, making it effectively an automated correctness testing tool.

PhilipHazel commented 1 year ago

I has occurred to me that there is a problem with this. From time to time there are changes that affect both the interpreter and the JIT, and they are not usually done simultaneously. So there is a period during which the two are out of step. If, for example, I change the interpreter (as happened recently with some UTF changes), I mark the relevant tests \=no_jit so that the tests all pass. Then when Zoltan makes the JIT changes, that flag can be removed. Your new automatic tests are likely to throw up unnecessary reports in this scenario. Do we care? I don't know.

On Sun, 26 Nov 2023 at 06:23, Addison Crump @.***> wrote:

I am automatically looking for these cases with a fuzzer. Handling these edge cases now means that any future changes which cause JIT/interpreter to be inconsistent will automatically be flagged by OSS-Fuzz, making it effectively an automated correctness testing tool.

— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/326#issuecomment-1826642834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUAHDUCTGGKFZGU5EJIDYGLN7XAVCNFSM6AAAAAA7FKI4OWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWGY2DEOBTGQ . You are receiving this because you commented.Message ID: @.***>

zherczeg commented 1 year ago

My question is what to do with (*skip) and anchored. As for me these two does not work together, probably a user error somewhere, so I would prefer a syntax (compile) error here. If we want to support this, what should we do? Is it the same as a (*commit) in this case?

PhilipHazel commented 1 year ago

Surely (SKIP) is essentially the same as (FAIL) because there is only one match attempt. That is the effect of how the interpreter treats it. (*SKIP) passes back "start the next match attempt here", but because the pattern is anchored, there is no next match, and the result is "no match". Perl seems to do the same. Giving a compile error isn't straightforward because of auto-anchoring, use of ^ and \A etc., possibly in different alternatives.

addisoncrump commented 1 year ago

From time to time there are changes that affect both the interpreter and the JIT, and they are not usually done simultaneously.

There are two methods by which we can approach handling this case.

  1. If the feature is merely detected as unsupported by the JIT and the relevant error code is emitted, we can simply catch this in the fuzzer and skip the comparison. As long as we can detect or otherwise be made aware of what the JIT/interpreter can/cannot do, we can just filter these results out.
  2. Implement the relevant changes inside a pull request. OSS-Fuzz will not fuzz branches other than master, so you can stage your changes, test them, and then merge them.
zherczeg commented 1 year ago

I don't think this can be a (*FAIL) since it does not retry matches after a skip. It is more like a (COMMIT) to me. But I want to make sure before I change things. Is it possible auto anchoring with `(SKIP)`?

PhilipHazel commented 1 year ago

Good point. Now that I look I see that there is explicit code to NOT auto-anchor if the pattern contains SKIP or PRUNE. The interpreter treats SKIP like this:

/^..A(*SKIP)B|C/
    12ADC
--->12ADC
 +0 ^         ^
 +1 ^         .
 +2 ^^        .
 +3 ^ ^       A
 +4 ^  ^      (*SKIP)
+11 ^  ^      B
 +0    ^      ^
+13    ^      C
 +0     ^     ^
+13     ^     C
+14     ^^    End of pattern
 0: C

In other words, having matched ..A at the start, when it fails to match B, it does not go on to look for C at the start, but does a bumpalong, but because of the SKIP the next iteration starts after A instead of at 2. Of course, this is not an anchored pattern.

PhilipHazel commented 1 year ago

In reply to @addisoncrump: If a new opcode (for example) is added to the interpreter, the JIT won't recognize it and so won't compile. Matching will automatically fall back to the interpreter if called via pcre2_match() or give an error for pcre2_jit_match(). It's when there's a change to an existing opcode behaviour that there might be an issue. I am a newbie at Git and likely to screw up if I try to learn too many new tricks but I suppose such changes could be done in a new branch that is pushed to the repo. But then there may be clashes with other changes if there's a delay. Sorry, just trying not to have to change too many habits of a lifetime.....

PhilipHazel commented 1 year ago

The interpreter and JIT now seem to be in agreement in the patterns listed above. Can we now close this issue?

addisoncrump commented 1 year ago

There appear to be remaining variants:

  re> /a+?(*THEN)/endanchored
data> a
 0: a
data> aaaa
No match
data> aaaa\=no_jit
 0: a
PhilipHazel commented 1 year ago

@zherczeg, once again I think the interpreter is right here. It behaves the same for the Perl-compatible pattern /a+?(*THEN)\z/ and so does Perl.

addisoncrump commented 11 months ago

Closing this in favour of specific issues.