Closed alexandre-daubois closed 11 months ago
Your reproducer seems to be testing PCRE1 versions - the 8.xx series. PCRE1 is obsolete and no longer maintained. PCRE2 is now nearly 9 years old and is the version maintained in this repository. Current PCRE2 matches your three characters as one cluster. This is the output from pcre2test:
PCRE2 version 10.42 2022-12-11 /\X/utf ☢☎❄ 0: \x{2622}\x{260e}\x{2744}
However, I see that it differs from Perl v5.38.1 which matches only the first character (as does PCRE1 so maybe you are using PCRE2 after all). In PCRE2, the documentation in pcre2pattern lists the cluster breaking rules. This seems to be the relevant one:
All three of your characters have the extended pictographic property in Unicode 15.0.0, which is what current PCRE2 supports. The rules came from Unicode documentation https://unicode.org/reports/tr29/. PCRE1 does not have rule 6, and indeed, before 8.31 used even simpler rules. In TR29 there is this sentence: "Each emoji sequence is a single grapheme cluster. See definition ED-17 in Unicode Technical Standard #51, "Unicode Emoji" [UAX51]." So it seems to me that PCRE2 is correctly following the rules.
Indeed the problem seems to come from php-intl. Will report the bug, thanks for your explanations :+1:
Hi! 👋
We're facing an issue in the Symfony repository in the CI: https://ci.appveyor.com/project/fabpot/symfony/builds/48712798
The problems comes from the Grapheme Cluster polyfill when the
php-intl
extension is not available. This polyfill uses the\X
matcher of PCRE to get the length of a unicode string. However, it seems the it dosen't work with symbols. Indeed, for the following sequence:The
\X
matcher returns a length of 1, where 3 is expected. Here's the reproducer: https://3v4l.org/C0UuO. As you can see, the$matches
array only returns 1 result containing all three symbols.