Inconsistent behaviour of character classes + ucp in 16- and 32-bit mode

PCRE2Project / pcre2

PCRE2 development is now based here.

Other

921 stars 194 forks source link

Inconsistent behaviour of character classes + ucp in 16- and 32-bit mode #360

Closed addisoncrump closed 11 months ago

addisoncrump commented 11 months ago

Seems that adding the dictionary was good for #322.

$ ./pcre2test -jit -16
PCRE2 version 10.43-DEV 2023-04-14 (16-bit)
  re> /[^[:print:]\x{f6f6}]/ucp
data> \x{f6f6}
 0: \x{f6f6}
data> \x{f6f6}\=no_jit
No match

JIT seems to perform incorrectly here, \x{f6f6} should not be matched. Behaviour disappears when ucp flag is not set.

addisoncrump commented 11 months ago

Seemingly related:

  re> /[[:xdigit:]\x{6500}]a/ucp
data> \x{6500}a
No match
data> \x{6500}a\=no_jit
 0: \x{6500}a

addisoncrump commented 11 months ago

Note that similar issues seem to appear in 32-bit mode as well.

carenas commented 11 months ago

@addisoncrump: could you better explain what is the issue from your point of view?

also worth mentioning it is at least not a regression.

PhilipHazel commented 11 months ago

This seems to be a straightforward JIT bug. The interpreter gives the same answer as Perl.

carenas commented 11 months ago

This seems to be a straightforward JIT bug

I guess the part I am missing is what is "this".

It seems that JIT returns bad results in both 16 and 32 bit libraries but not in 8 bit, and it ONLY fails when there is an implicit "inverse union" (this last part I am not even sure, as the description from the issue doesn't make sense and examples that don't have "^" in the class definition are also provided)

FWIW, PCRE2 is missing the whole implementation of Unicode class logical operations as suggested by TR#18 and that might also "fix" this if implemented IMHO.

PhilipHazel commented 11 months ago

Sorry, I thought it was obvious. /[^[:print:]\x{f6f6}]/ucp should match a character that is not printing and not 0xf6f6. Clearly this should not match 0xf6f6, but in JIT 16/32 bit modes, it does. Also in 8-bit mode with UTF set. Similarly, /[[:xdigit:]\x{6500}]a/ should match a hex digit or 0x6500, followed by "a", but JIT doesn't.

As far as class operations go, see #13.

addisoncrump commented 11 months ago

Apologies for the late response. The holidays are a busy time, ironically!

I am using terminology based on how I learned regex (in formal automata class, not in PCRE2!) and so there's a disconnect there. "Inverse union" is a negatively-matching character set (i.e. "if it's in this set it should not match") and the "union" here is just the set operation that pulls together \x{f6f6} and the character class [:print:]. As @PhilipHazel suggests, the issue here is that the JIT provides a verifiably incorrect response as the JIT matches the input character which is included in the character set that should not be matched.

carenas commented 11 months ago

The holidays are a busy time, ironically!

could you change the subject of this ticket to indicate it affects all libraries but the 8bit one?

in the description (as Phillip mentioned), that it affects JIT only, has nothing to do with the use of "^" (as shown by the example with xdigit), and that is not a regression at least for 10.42?

addisoncrump commented 11 months ago

Very well :slightly_smiling_face:

zherczeg commented 11 months ago

542cb11242cfc9be9b6218965751bfbb13a8b6a2 should fix this

addisoncrump commented 11 months ago

Seems correct -- I will reopen if there are new corner cases discovered.