Raku / old-issue-tracker

Tickets from RT
https://github.com/Raku/old-issue-tracker/issues
2 stars 1 forks source link

Emoji sequences with ZERO WIDTH JOINER counted as separate chars when they probably shouldn't #4946

Closed p6rt closed 7 years ago

p6rt commented 8 years ago

Migrated from rt.perl.org#127048 (status was 'resolved')

Searchable as RT127048$

p6rt commented 8 years ago

From @AlexDaniel

This is a continuation of https://rt.perl.org/Public/Bug/Display.html?id=127047

From http://unicode.org/reports/tr51/#Emoji_ZWJ_Sequences​:

“The U+200D ZERO WIDTH JOINER (ZWJ) can be used between the elements of a sequence of characters to indicate that a single glyph should be presented if available.”

“So to the user, these would behave like single emoji characters, even though internally they are sequences.”

It sounds like we shouldn't cut these sequences in half when doing .substr (which in turn means that these should be treated as one grapheme).

There is a chart of possible combinations here http://www.unicode.org/emoji/charts/emoji-zwj-sequences.html, but I think that any sequence with U+200D ZERO WIDTH JOINER should probably result in one grapheme. As crazy as it sounds…

p6rt commented 8 years ago

From @AlexDaniel

It should also be noted that ZERO WIDTH JOINER is used for other purposes too​: https://books.google.ee/books?id=wn5sXG8bEAcC&lpg=PA287&ots=J1bym1VbXE&dq=unicode%20%22ZERO%20WIDTH%20JOINER&pg=PA287#v=onepage&q=unicode%20%22ZERO%20WIDTH%20JOINER&f=false

But I'm not sure if it should affect the character count in such cases.

p6rt commented 7 years ago

From @samcv

This has been resolved since a month or so ago. This was closed with this commit​: https://github.com/MoarVM/MoarVM/commit/fa5158a3

p6rt commented 7 years ago

@samcv - Status changed from 'new' to 'resolved'