Closed addisoncrump closed 11 months ago
Seemingly related:
re> /[[:xdigit:]\x{6500}]a/ucp
data> \x{6500}a
No match
data> \x{6500}a\=no_jit
0: \x{6500}a
Note that similar issues seem to appear in 32-bit mode as well.
@addisoncrump: could you better explain what is the issue from your point of view?
also worth mentioning it is at least not a regression.
This seems to be a straightforward JIT bug. The interpreter gives the same answer as Perl.
This seems to be a straightforward JIT bug
I guess the part I am missing is what is "this".
It seems that JIT returns bad results in both 16 and 32 bit libraries but not in 8 bit, and it ONLY fails when there is an implicit "inverse union" (this last part I am not even sure, as the description from the issue doesn't make sense and examples that don't have "^" in the class definition are also provided)
FWIW, PCRE2 is missing the whole implementation of Unicode class logical operations as suggested by TR#18 and that might also "fix" this if implemented IMHO.
Sorry, I thought it was obvious. /[^[:print:]\x{f6f6}]/ucp should match a character that is not printing and not 0xf6f6. Clearly this should not match 0xf6f6, but in JIT 16/32 bit modes, it does. Also in 8-bit mode with UTF set. Similarly, /[[:xdigit:]\x{6500}]a/ should match a hex digit or 0x6500, followed by "a", but JIT doesn't.
As far as class operations go, see #13.
Apologies for the late response. The holidays are a busy time, ironically!
I am using terminology based on how I learned regex (in formal automata class, not in PCRE2!) and so there's a disconnect there. "Inverse union" is a negatively-matching character set (i.e. "if it's in this set it should not match") and the "union" here is just the set operation that pulls together \x{f6f6}
and the character class [:print:]
. As @PhilipHazel suggests, the issue here is that the JIT provides a verifiably incorrect response as the JIT matches the input character which is included in the character set that should not be matched.
The holidays are a busy time, ironically!
could you change the subject of this ticket to indicate it affects all libraries but the 8bit one?
in the description (as Phillip mentioned), that it affects JIT only, has nothing to do with the use of "^" (as shown by the example with xdigit), and that is not a regression at least for 10.42?
Very well :slightly_smiling_face:
542cb11242cfc9be9b6218965751bfbb13a8b6a2 should fix this
Seems correct -- I will reopen if there are new corner cases discovered.
Seems that adding the dictionary was good for #322.
JIT seems to perform incorrectly here,
\x{f6f6}
should not be matched. Behaviour disappears when ucp flag is not set.