PCRE2Project / pcre2

PCRE2 development is now based here.
Other
919 stars 191 forks source link

Simplify range data construction. #496

Closed zherczeg closed 2 months ago

zherczeg commented 2 months ago

This patch uses the computed ranges to generate byte code rather than using add_to_class. It is a considerable simplification of the code.

zherczeg commented 2 months ago

@PhilipHazel if you agree my suggestion in https://github.com/PCRE2Project/pcre2/issues/497 , this patch is ready

carenas commented 2 months ago

I think the following assertion is not correct:

Obviously a character class and its negated form cannot match to the same character

in PCRE2 it can, and the reasons are historic and described in #186.

In summary our "/u" Perl equivalent requires both utf and ucp modifiers to be set

zherczeg commented 2 months ago

I am not sure I understand that part, it talks about configuring the modifier. Normally if [C] matches to something, [^C] must not match to that except for invalid utf characters, which never matches to anything like NaN in numbers.

carenas commented 2 months ago

The point is that without PCRE2_UCP(as you pointed out) all characters above 255 (in the 8-bit library) are not defined, so any [^C] would match them if PCRE2_UTF is enabled. As you pointed out Perl has no non-UCP mode, but we do, and we even have UCP mode without UTF (ex: in the 16-bit library).

Agree with you that they "shouldn't" match and that is arguably a bug, but it is the currently expected behaviour when ONLY one of those options are set.

The "ambiguity" is resolved at compile time by the redefinition of \D that PCRE2_UCP drives as shown by:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
        Bra
        [^\P{Nd}\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}
data> 
  re> /[\D\P{Nd}]/B,utf,ucp
------------------------------------------------------------------
        Bra
        [\P{Nd}\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
No match
carenas commented 2 months ago

indeed I think this might had just introduced a regression:

PCRE2 version 10.44 2024-06-07 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[^\D\P{Nd}]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------

At least this one works, but the B output is confusing (not an issue introduced with this patch though):

 re> /[^\d]/B,utf,ascii_bsd
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff] (neg)
        Ket
        End
------------------------------------------------------------------
data> 1
No match
zherczeg commented 2 months ago

It is fixed that regression, and this is what I am talking about. \D matches anything not [0-9], which includes all > 255 characters.

  re> /[\d]/B,utf
------------------------------------------------------------------
        Bra
        [0-9]
        Ket
        End
------------------------------------------------------------------
  re> /[\D]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff] (neg)
        Ket
        End
------------------------------------------------------------------

The [\x00-/:-\xff] (neg) is the same as [\x00-/:-\xff\x{100}-\x{10ffff}]. This is negated above with ^.