Open zherczeg opened 1 week ago
This is a very interesting case!
There's something related here - which is the behaviour of \P{...}
with /i
modifier.
They specifically changed the behaviour of [^\P{...}]
in ECMAScript. If you run the regex using /[^\P{...}]/iu
vs using /[^\P{...}]/iv
in JavaScript you get different results. (Note the change from "/u Unicode" to "/v Next-gen Unicode".)
They discussed this change here: https://github.com/tc39/proposal-regexp-v-flag/issues/30
(Note - there's a lot of confusion and irrelevant chatter on that thread. Most of it just irrelevant.)
\D
mean "the complement/inverted-set" for whatever characters are in the \d
set. This is applied immediately, at compile-time.\p{...}
and \P{...}
.^
is applied last to complement/invert the set of matched characters.With the /i
modifier:
[A]
means → add case_fold('A')
to the set of matched characters[A-B]
means → take the set of characters {A...B}
and form the set {case_fold(c) for c ∈ {A...B}}
, and add that(?i:[...])
to apply to portions of the regex only.\P{...}
\p{...}
. Then invert it to form the \P{...}
set. Then finally form the set {case_fold(c) for c ∈ \P{...}}
\p{..}
set, then case-fold it, then invert it, so that /\P{...}/i
means complement({case_fold(c) for c ∈ \p{...}})
Things like \d
and \D
behave the same as \p{...}
or \P{...}
. They are all just shorthands for a set of characters.
Which \p{} is affected by case folding? It seemed to me (and I assumed), that classes always contains the other cases of all of its characters. Most properties are scripts or control characters. I suspect some extended script is affected.
Well \P{Ll}
and \P{Lu}
do, maybe others. It drastically affects their meaning under /i
, whether you do the case-fold before or after the inversion implied by \P
.
True. In PCRE2, /i does not affect properties. This way we don't need to generate that many databases.
In PCRE2, /i does not affect properties
Note that is no longer the case since #432
Perl does this:
Perl v5.40.0
/[\D\P{Nd}]/utf \x{1d7cf} No match
/[^\D\P{Nd}]/utf \x{1d7cf} 0: \x{1d7cf}
Which I think is right - PCRE2 is currently wrong. As the character is greater than 255 and UCP is not set, the bit map set up by \D is not relevant and only \P should count. However, if the first pattern is just [\P{Nd}] there is no match. So there is indeed a bug in PCRE2 and I think it is /[\D\P{Nd}]/utf that is incorrect.
I'm not sure in #497 which one you think is wrong and have fixed...
\D matches to anything that is not 0-9
, that includes all > 255 characters. Perl uses unicode for \D, that is why /[\D\P{Nd}]/utf
does not match. [\d]
matches to \x{1d7cf} in perl. Perl has no non-ucp mode.
Ah yes, of course. I was forgetting that. OK, let's merge your patch.
This test is in testinput5:
Currently this pattern matches to \x{1d7cf}
Since \D is in ascii mode (ucp is not enabled),
/[\D]/utf
matches to anything not 0-9. That should include \x{1d7cf}. This looks true:Note: /[\P{Nd}]/utf does not match to \x{1d7cf}
Summary: both
/[^\D\P{Nd}]/utf
and/[\D\P{Nd}]/utf
matches to \x{1d7cf}. Obviously a character class and its negated form cannot match to the same character, and I think the first one is incorrect. My newest code changes this pattern to no-match, and I wanted to discuss it.