PCRE2Project / pcre2

PCRE2 development is now based here.
Other
922 stars 194 forks source link

JIT/non-JIT match difference with `*+` repeated groups #324

Closed addisoncrump closed 1 year ago

addisoncrump commented 1 year ago

Discovered with #322.

The regex (A)*+, with A as any literal, does not have consistent matches between JIT and non-JIT when used with endanchored.

sh-5.2$ xxd endanchored_crash 
00000000: 0000 0000 0000 0020 2841 292a 2b         ....... (A)*+
sh-5.2$ ./pcre2_fuzzer endanchored_crash
Encountered failure while performing match errorcode comparison; context:
Pattern/sample string (hex encoded): 2841292a2b
Compile options 20100000 never_backslash_c,endanchored
Match options 00002000
Non-JIT'd operation did not emit an error.
1 matches discovered by non-JIT'd regex:
Match 0 (hex encoded): 

JIT'd operation did not emit an error.
2 matches discovered by JIT'd regex:
Match 0 (hex encoded): 
Match 1 (hex encoded): 41

Disabling endanchored causes the difference to disappear. I'm not sure what the correct behaviour should be.

addisoncrump commented 1 year ago

The issue can also be observed with (B)*+A and without endanchored:

sh-5.2$ xxd no_endanchored_crash 
00000000: 0000 0000 0000 0000 2842 292a 2b41       ........(B)*+A
sh-5.2$ ./pcre2_fuzzer no_endanchored_crash
Encountered failure while performing match errorcode comparison; context:
Pattern/sample string (hex encoded): 2842292a2b41
Compile options 00100000 never_backslash_c
Match options 00002000
Non-JIT'd operation did not emit an error.
1 matches discovered by non-JIT'd regex:
Match 0 (hex encoded): 41

JIT'd operation did not emit an error.
2 matches discovered by JIT'd regex:
Match 0 (hex encoded): 41
Match 1 (hex encoded): 42
zherczeg commented 1 year ago

These are hard to understand. Is the sample sting is the same as pattern? What are the dots do before the pattern? Would it be possible to write the offsets for the match? Could it print the flags for the match options?

addisoncrump commented 1 year ago
addisoncrump commented 1 year ago

The 0x20 is part of the flag data.

$ ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
  re> /(A)*+/never_backslash_c,endanchored
data> (A)*+
 0: 
 1: A
data> (A)*+\=no_jit
 0: 
addisoncrump commented 1 year ago

In the future, I'll submit reproduction cases with pcre2test instead for clarity :slightly_smiling_face: Apologies for the confusion.

zherczeg commented 1 year ago

It also appears this way: /(A)*+$/ I suspect reverting the iteration does not work. I will check later.

carenas commented 1 year ago

In the future, I'll submit reproduction cases with pcre2test instead for clarity

thanks; it would be also a good idea to remove "never_backslash_c" from it, as that is also likely unrelated and only needed by the fuzzer to avoid crashing in unlucky input AFAIK.

in this case, using the PCRE2_ENDANCHORED flag at match time (which might be the correct way to use this) also seem to "fix" JIT as shown by:

% ./pcre2test -jit
PCRE2 version 10.43-DEV 2023-04-14 (8-bit)
  re> /(A)*+/endanchored
data> (A)*+\=endanchored
 0: 
data> (A)*+\=endanchored,no_jit
 0: 
data> 
zherczeg commented 1 year ago

This is a bug in jit. I forgot a braposzero check. I will make a patch for it.

carenas commented 1 year ago

This is a bug in jit

it seems that it is broken all the way back up to the no longer maintained 8.x series as shown by:

% pcretest -s++         
PCRE version 8.45 2021-06-15

  re> /(B)*+A/
data> (B)*+A
 0: A (JIT)
 1: B (JIT)
zherczeg commented 1 year ago

Fixed in b43e745298ff80368c1e5c674d5f74c76bc4c13e

addisoncrump commented 1 year ago

Not seeing any unintended side effects from the fix... Closing this :slightly_smiling_face: