PCRE2Project / pcre2

PCRE2 development is now based here.
Other
894 stars 186 forks source link

Pattern /[^\D\P{Nd}]/utf matches to \x{1d7cf} #497

Open zherczeg opened 1 week ago

zherczeg commented 1 week ago

This test is in testinput5:

  re> /[^\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [^\x00-/:-\xff\P{Nd}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Currently this pattern matches to \x{1d7cf}

Since \D is in ascii mode (ucp is not enabled), /[\D]/utf matches to anything not 0-9. That should include \x{1d7cf}. This looks true:

  re> /[\D\P{Nd}]/B,utf
------------------------------------------------------------------
        Bra
        [\x00-/:-\xff\P{Nd}\x{100}-\x{10ffff}]
        Ket
        End
------------------------------------------------------------------
data> \x{1d7cf}
 0: \x{1d7cf}

Note: /[\P{Nd}]/utf does not match to \x{1d7cf}

Summary: both /[^\D\P{Nd}]/utf and /[\D\P{Nd}]/utf matches to \x{1d7cf}. Obviously a character class and its negated form cannot match to the same character, and I think the first one is incorrect. My newest code changes this pattern to no-match, and I wanted to discuss it.

NWilson commented 1 week ago

This is a very interesting case!

There's something related here - which is the behaviour of \P{...} with /i modifier.

They specifically changed the behaviour of [^\P{...}] in ECMAScript. If you run the regex using /[^\P{...}]/iu vs using /[^\P{...}]/iv in JavaScript you get different results. (Note the change from "/u Unicode" to "/v Next-gen Unicode".)

They discussed this change here: https://github.com/tc39/proposal-regexp-v-flag/issues/30

(Note - there's a lot of confusion and irrelevant chatter on that thread. Most of it just irrelevant.)

How character classes work

With the /i modifier:

Things like \d and \D behave the same as \p{...} or \P{...}. They are all just shorthands for a set of characters.

zherczeg commented 1 week ago

Which \p{} is affected by case folding? It seemed to me (and I assumed), that classes always contains the other cases of all of its characters. Most properties are scripts or control characters. I suspect some extended script is affected.

NWilson commented 1 week ago

Well \P{Ll} and \P{Lu} do, maybe others. It drastically affects their meaning under /i, whether you do the case-fold before or after the inversion implied by \P.

zherczeg commented 1 week ago

True. In PCRE2, /i does not affect properties. This way we don't need to generate that many databases.

carenas commented 1 week ago

In PCRE2, /i does not affect properties

Note that is no longer the case since #432

PhilipHazel commented 1 week ago

Perl does this:

Perl v5.40.0

/[\D\P{Nd}]/utf \x{1d7cf} No match

/[^\D\P{Nd}]/utf \x{1d7cf} 0: \x{1d7cf}

Which I think is right - PCRE2 is currently wrong. As the character is greater than 255 and UCP is not set, the bit map set up by \D is not relevant and only \P should count. However, if the first pattern is just [\P{Nd}] there is no match. So there is indeed a bug in PCRE2 and I think it is /[\D\P{Nd}]/utf that is incorrect.

PhilipHazel commented 1 week ago

I'm not sure in #497 which one you think is wrong and have fixed...

zherczeg commented 1 week ago

\D matches to anything that is not 0-9, that includes all > 255 characters. Perl uses unicode for \D, that is why /[\D\P{Nd}]/utf does not match. [\d] matches to \x{1d7cf} in perl. Perl has no non-ucp mode.

PhilipHazel commented 1 week ago

Ah yes, of course. I was forgetting that. OK, let's merge your patch.