PCRE2Project / pcre2

PCRE2 development is now based here.
Other
884 stars 185 forks source link

Grapheme cluser (`\X`) selector capturing multiple character #410

Closed Ayesh closed 3 months ago

Ayesh commented 3 months ago

Using PCRE2 10.43, the \X selector seems to capture more than one graphemes, as if does not break before the start of a new grapheme cluster.

Regex: \X Input: 🏳️‍🌈🏴‍☠️ (U+1F3F3 U+FE0F U+200D U+1F308 + U+1F3F4 U+200D U+2620 U+FE0F)

When run, \X matches both flag graphemes: Regex101 preview.

Could you kindly shed me a light if I'm missing something?

Thank you.

PhilipHazel commented 3 months ago

This is a bug, caused by my misreading or misunderstanding one of the rules in Unicode Annex 29, way back when I implemented \X. I'm a bit surprised it's taken so long for it to hit anybody. Furthermore, the documentation correctly describes what the code does, but it's not what it's supposed to do! (Somewhere I even noted a difference from Perl, but never investigated.) I hope to have this fixed in HEAD in the next day or two. This is a very timely issue because the 10.44 release will be forthcoming once this fix is done. Thanks for the report.

Ayesh commented 3 months ago

Thank you. I tested after commit 067c2f1f5851335d4b6feff8b5c5a566d6f9e669, it worked correctly!