Closed addisoncrump closed 11 months ago
I think this might be a JIT bug. There is definitely something wrong, as shown by this test:
$ ./pcre2test -jit zz PCRE2 version 10.43-DEV 2023-04-14 (8-bit) /A(SKIP)A|/anchored A(SKIP)A|\=aftertext 0: 0+ (*SKIP)A|
The "aftertext" option causes pcre2test to output the text that follows the match and the output shows that it is matching an empty string following the first A, which must be wrong because the pattern is anchored.
Potentially related case that the fuzzer spit out just now:
re> /(*ACCEPT)*j/endanchored
data> j\=notempty
No match
data> j\=no_jit,notempty
0: j
For me (*ACCEPT)*
looks like a parse error. Maybe some perl mess.
re> /(*ACCEPT)*j/B
------------------------------------------------------------------
Bra
Brazero
SBra
*ACCEPT
KetRmax
j
Ket
End
------------------------------------------------------------------
It looks like it is /(?:(ACCEPT))j/, and that should be valid.
Anyway, j
cannot be matched since (*ACCEPT)
stops matching the pattern, and the result is always an empty string here. With notempty
the pattern never matches.
I think (ACCEPT) should provoke an error "qualifier does not follow a repeatable item" just like (SKIP) does, though oddly Perl does not give an error for these.
I agree. So there are two issues here: one is parser related, the other is execution (since /(?:(ACCEPT))j/ is valid, and generates the same code as /(*ACCEPT)*j/
)
Another case with THEN
for the repertoire:
$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
re> /(?<!(*THEN)a|)/
data> b
0:
data> b\=no_jit
No match
Finally with PRUNE
:
$ ~/git/pcre2/pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
re> /.*(*PRUNE)a|/
data> a
0:
data> a\=no_jit
No match
On checking the history, I found that quantifying (ACCEPT) is explicitly allowed. The ChangeLog entry for 10.34 says this: "Allow (ACCEPT) to be quantified, because an ungreedy quantifier with a zero minimum is potentially useful." Also, it seems that recent changes have brought JIT and the interpreter into agreement for the (*ACCEPT) example:
/(ACCEPT)j/endanchored j\=notempty No match
/(ACCEPT)j/endanchored,jitverify j\=notempty No match (JIT)
The (PRUNE) example is also OK. (THEN) is still an issue that seems to be an interpreter bug; I'm looking into it.
I think the interpreter is correct here:
/(?<!(*THEN)a|)/ b No match
/(?<!(*THEN)a|)/jitverify b 0: (JIT)
(*THEN) at the start of a branch is effectively a no-op. The lookbehind should match an empty string, but it's a negative lookbehind so it should therefore fail. And I can confirm that Perl agrees with the interpreter. So I think this is a JIT issue.
Fixed in e76eae6
Another victim of vreverse. I knew it will be painful.
Fix is incomplete:
$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
re> /A(*SKIP)\n|/anchored
data> AB\n
0:
data> AB\n\=no_jit
No match
I just fixed the last test, this is unrelated. Btw why do you combine skip with anchored? They do the opposite things.
I am automatically looking for these cases with a fuzzer. Handling these edge cases now means that any future changes which cause JIT/interpreter to be inconsistent will automatically be flagged by OSS-Fuzz, making it effectively an automated correctness testing tool.
I has occurred to me that there is a problem with this. From time to time there are changes that affect both the interpreter and the JIT, and they are not usually done simultaneously. So there is a period during which the two are out of step. If, for example, I change the interpreter (as happened recently with some UTF changes), I mark the relevant tests \=no_jit so that the tests all pass. Then when Zoltan makes the JIT changes, that flag can be removed. Your new automatic tests are likely to throw up unnecessary reports in this scenario. Do we care? I don't know.
On Sun, 26 Nov 2023 at 06:23, Addison Crump @.***> wrote:
I am automatically looking for these cases with a fuzzer. Handling these edge cases now means that any future changes which cause JIT/interpreter to be inconsistent will automatically be flagged by OSS-Fuzz, making it effectively an automated correctness testing tool.
— Reply to this email directly, view it on GitHub https://github.com/PCRE2Project/pcre2/issues/326#issuecomment-1826642834, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG4QUAHDUCTGGKFZGU5EJIDYGLN7XAVCNFSM6AAAAAA7FKI4OWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWGY2DEOBTGQ . You are receiving this because you commented.Message ID: @.***>
My question is what to do with (*skip)
and anchored. As for me these two does not work together, probably a user error somewhere, so I would prefer a syntax (compile) error here. If we want to support this, what should we do? Is it the same as a (*commit) in this case?
Surely (SKIP) is essentially the same as (FAIL) because there is only one match attempt. That is the effect of how the interpreter treats it. (*SKIP) passes back "start the next match attempt here", but because the pattern is anchored, there is no next match, and the result is "no match". Perl seems to do the same. Giving a compile error isn't straightforward because of auto-anchoring, use of ^ and \A etc., possibly in different alternatives.
From time to time there are changes that affect both the interpreter and the JIT, and they are not usually done simultaneously.
There are two methods by which we can approach handling this case.
master
, so you can stage your changes, test them, and then merge them.I don't think this can be a (*FAIL)
since it does not retry matches after a skip. It is more like a (COMMIT) to me. But I want to make sure before I change things. Is it possible auto anchoring with `(SKIP)`?
Good point. Now that I look I see that there is explicit code to NOT auto-anchor if the pattern contains SKIP or PRUNE. The interpreter treats SKIP like this:
/^..A(*SKIP)B|C/
12ADC
--->12ADC
+0 ^ ^
+1 ^ .
+2 ^^ .
+3 ^ ^ A
+4 ^ ^ (*SKIP)
+11 ^ ^ B
+0 ^ ^
+13 ^ C
+0 ^ ^
+13 ^ C
+14 ^^ End of pattern
0: C
In other words, having matched ..A at the start, when it fails to match B, it does not go on to look for C at the start, but does a bumpalong, but because of the SKIP the next iteration starts after A instead of at 2. Of course, this is not an anchored pattern.
In reply to @addisoncrump: If a new opcode (for example) is added to the interpreter, the JIT won't recognize it and so won't compile. Matching will automatically fall back to the interpreter if called via pcre2_match() or give an error for pcre2_jit_match(). It's when there's a change to an existing opcode behaviour that there might be an issue. I am a newbie at Git and likely to screw up if I try to learn too many new tricks but I suppose such changes could be done in a new branch that is pushed to the repo. But then there may be clashes with other changes if there's a delay. Sorry, just trying not to have to change too many habits of a lifetime.....
The interpreter and JIT now seem to be in agreement in the patterns listed above. Can we now close this issue?
There appear to be remaining variants:
re> /a+?(*THEN)/endanchored
data> a
0: a
data> aaaa
No match
data> aaaa\=no_jit
0: a
@zherczeg, once again I think the interpreter is right here. It behaves the same for the Perl-compatible pattern /a+?(*THEN)\z/ and so does Perl.
Closing this in favour of specific issues.
Discovered by #322.
The pattern
A(*SKIP)A|
(number ofA
s on either side may be replaced by any non-zero amount of literals) inconsistently matches itself (and I think any string) in combination with the anchored flag being set:I suspect that in this case, the JIT code emitted for SKIP does not consider anchoring, but I haven't investigated further.