Closed zherczeg closed 2 months ago
@PhilipHazel if you agree my suggestion in https://github.com/PCRE2Project/pcre2/issues/497 , this patch is ready
I think the following assertion is not correct:
Obviously a character class and its negated form cannot match to the same character
in PCRE2 it can, and the reasons are historic and described in #186.
In summary our "/u" Perl equivalent requires both utf
and ucp
modifiers to be set
I am not sure I understand that part, it talks about configuring the modifier. Normally if [C]
matches to something, [^C]
must not match to that except for invalid utf characters, which never matches to anything like NaN in numbers.
The point is that without PCRE2_UCP
(as you pointed out) all characters above 255 (in the 8-bit library) are not defined, so any [^C] would match them if PCRE2_UTF
is enabled. As you pointed out Perl has no non-UCP mode, but we do, and we even have UCP mode without UTF (ex: in the 16-bit library).
Agree with you that they "shouldn't" match and that is arguably a bug, but it is the currently expected behaviour when ONLY one of those options are set.
The "ambiguity" is resolved at compile time by the redefinition of \D
that PCRE2_UCP
drives as shown by:
PCRE2 version 10.44 2024-06-07 (8-bit)
re> /[^\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
Bra
[^\P{Nd}\P{Nd}]
Ket
End
------------------------------------------------------------------
data> \x{1d7cf}
0: \x{1d7cf}
data>
re> /[\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
Bra
[\P{Nd}\P{Nd}]
Ket
End
------------------------------------------------------------------
data> \x{1d7cf}
No match
indeed I think this might had just introduced a regression:
PCRE2 version 10.44 2024-06-07 (8-bit)
re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
Bra
[^\x00-/:-\xff\P{Nd}]
Ket
End
------------------------------------------------------------------
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
Bra
[^\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
Ket
End
------------------------------------------------------------------
At least this one works, but the B
output is confusing (not an issue introduced with this patch though):
re> /[^\d]/B,utf,ascii_bsd
------------------------------------------------------------------
Bra
[\x00-/:-\xff] (neg)
Ket
End
------------------------------------------------------------------
data> 1
No match
It is fixed that regression, and this is what I am talking about. \D matches anything not [0-9], which includes all > 255 characters.
re> /[\d]/B,utf
------------------------------------------------------------------
Bra
[0-9]
Ket
End
------------------------------------------------------------------
re> /[\D]/B,utf
------------------------------------------------------------------
Bra
[\x00-/:-\xff] (neg)
Ket
End
------------------------------------------------------------------
The [\x00-/:-\xff] (neg)
is the same as [\x00-/:-\xff\x{100}-\x{10ffff}]
. This is negated above with ^
.
This patch uses the computed ranges to generate byte code rather than using
add_to_class
. It is a considerable simplification of the code.