Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf}

zherczeg commented 1 week ago

This test is in testinput5:

  re> /[^\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Currently this pattern matches to \x{1d7cf}

Since \D is in ascii mode (ucp is not enabled), /[\D]/utf matches to anything not 0-9. That should include \x{1d7cf}. This looks true:

  re> /[\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Note: /[\P{Nd}]/utf does not match to \x{1d7cf}

Summary: both /[^\D\P{Nd}]/utf and /[\D\P{Nd}]/utf matches to \x{1d7cf}. Obviously a character class and its negated form cannot match to the same character, and I think the first one is incorrect. My newest code changes this pattern to no-match, and I wanted to discuss it.

NWilson commented 1 week ago

This is a very interesting case!

There's something related here - which is the behaviour of \P{...} with /i modifier.

They specifically changed the behaviour of [^\P{...}] in ECMAScript. If you run the regex using /[^\P{...}]/iu vs using /[^\P{...}]/iv in JavaScript you get different results. (Note the change from "/u Unicode" to "/v Next-gen Unicode".)

They discussed this change here: https://github.com/tc39/proposal-regexp-v-flag/issues/30

(Note - there's a lot of confusion and irrelevant chatter on that thread. Most of it just irrelevant.)

How character classes work

Take the characters and ranges, and union them to form a "set" of characters
Things like \D mean "the complement/inverted-set" for whatever characters are in the \d set. This is applied immediately, at compile-time.
Similarly for \p{...} and \P{...}.
Then, ^ is applied last to complement/invert the set of matched characters.
The input/match character is then tested to see if it's contained in the set.

With the /i modifier:

Each character that appears in the set is "case-folded".
- So [A] means → add case_fold('A') to the set of matched characters
- And [A-B] means → take the set of characters {A...B} and form the set {case_fold(c) for c ∈ {A...B}}, and add that
- Obviously the input (match) string is read character-by-character and the input characters are case-folded too before matching against the set
- The input string is case-handled character-by-character because the flag can be set using (?i:[...]) to apply to portions of the regex only.
The ambiguity that the ECMAScript guys got hung up on is around \P{...}
- The old behaviour was: form the set for \p{...}. Then invert it to form the \P{...} set. Then finally form the set {case_fold(c) for c ∈ \P{...}}
- The new behaviour in ECMAScript is to change the order. Form the \p{..} set, then case-fold it, then invert it, so that /\P{...}/i means complement({case_fold(c) for c ∈ \p{...}})

Things like \d and \D behave the same as \p{...} or \P{...}. They are all just shorthands for a set of characters.

zherczeg commented 1 week ago

Which \p{} is affected by case folding? It seemed to me (and I assumed), that classes always contains the other cases of all of its characters. Most properties are scripts or control characters. I suspect some extended script is affected.

NWilson commented 1 week ago

Well \P{Ll} and \P{Lu} do, maybe others. It drastically affects their meaning under /i, whether you do the case-fold before or after the inversion implied by \P.

zherczeg commented 1 week ago

True. In PCRE2, /i does not affect properties. This way we don't need to generate that many databases.

carenas commented 1 week ago

In PCRE2, /i does not affect properties

Note that is no longer the case since #432

PhilipHazel commented 1 week ago

Perl does this:

Perl v5.40.0

/[\D\P{Nd}]/utf \x{1d7cf} No match

/[^\D\P{Nd}]/utf \x{1d7cf} 0: \x{1d7cf}

Which I think is right - PCRE2 is currently wrong. As the character is greater than 255 and UCP is not set, the bit map set up by \D is not relevant and only \P should count. However, if the first pattern is just [\P{Nd}] there is no match. So there is indeed a bug in PCRE2 and I think it is /[\D\P{Nd}]/utf that is incorrect.

PhilipHazel commented 1 week ago

I'm not sure in #497 which one you think is wrong and have fixed...

zherczeg commented 1 week ago

\D matches to anything that is not 0-9, that includes all > 255 characters. Perl uses unicode for \D, that is why /[\D\P{Nd}]/utf does not match. [\d] matches to \x{1d7cf} in perl. Perl has no non-ucp mode.

PhilipHazel commented 1 week ago

Ah yes, of course. I was forgetting that. OK, let's merge your patch.

PCRE2Project / pcre2

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

How character classes work