PCRE2Project / pcre2

PCRE2 development is now based here.
Other
921 stars 194 forks source link

The `\X` matcher doesn't catch all symbols #361

Closed alexandre-daubois closed 11 months ago

alexandre-daubois commented 11 months ago

Hi! 👋

We're facing an issue in the Symfony repository in the CI: https://ci.appveyor.com/project/fabpot/symfony/builds/48712798

The problems comes from the Grapheme Cluster polyfill when the php-intl extension is not available. This polyfill uses the \X matcher of PCRE to get the length of a unicode string. However, it seems the it dosen't work with symbols. Indeed, for the following sequence:

☢☎❄

The \X matcher returns a length of 1, where 3 is expected. Here's the reproducer: https://3v4l.org/C0UuO. As you can see, the $matches array only returns 1 result containing all three symbols.

PhilipHazel commented 11 months ago

Your reproducer seems to be testing PCRE1 versions - the 8.xx series. PCRE1 is obsolete and no longer maintained. PCRE2 is now nearly 9 years old and is the version maintained in this repository. Current PCRE2 matches your three characters as one cluster. This is the output from pcre2test:

PCRE2 version 10.42 2022-12-11 /\X/utf ☢☎❄ 0: \x{2622}\x{260e}\x{2744}

However, I see that it differs from Perl v5.38.1 which matches only the first character (as does PCRE1 so maybe you are using PCRE2 after all). In PCRE2, the documentation in pcre2pattern lists the cluster breaking rules. This seems to be the relevant one:

  1. Do not break within emoji modifier sequences or emoji zwj sequences. That is, do not break between characters with the Extended_Pictographic property. Extend and ZWJ characters are allowed between the characters.

All three of your characters have the extended pictographic property in Unicode 15.0.0, which is what current PCRE2 supports. The rules came from Unicode documentation https://unicode.org/reports/tr29/. PCRE1 does not have rule 6, and indeed, before 8.31 used even simpler rules. In TR29 there is this sentence: "Each emoji sequence is a single grapheme cluster. See definition ED-17 in Unicode Technical Standard #51, "Unicode Emoji" [UAX51]." So it seems to me that PCRE2 is correctly following the rules.

alexandre-daubois commented 11 months ago

Indeed the problem seems to come from php-intl. Will report the bug, thanks for your explanations :+1:

https://onlinephp.io/c/30ad8