crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.22k stars 1.61k forks source link

Escaping non-printables in grapheme clusters with `String#inspect` #11630

Open straight-shoota opened 2 years ago

straight-shoota commented 2 years ago

This is a follow-up on #11406 which introduced escaping for all non-printable characters in String#inspect.

While that change is an improvement, it has a negative effect on grapheme clusters comprising non-printable characters. Consider a string of code points U+1F468 (Man) U+200D (Zero Width Joiner (ZWJ)) U+1F469 (Woman). They form a grapheme cluster that renders two persons as a single grapheme: 👨‍👩 (for reference, without the ZWJ the two emojis render as separate graphemes: 👨👩).

String#inspect escapes all non-printable characters since #11452. Zero Width Joiner is a non-printable character, so it gets escaped. For the above string, it means the grapheme cluster gets broken. The zero width joiner no longer glues the surrounding characters together:

# Crystal master (since #11452)
"👨‍👩" # => "👨\u200D👩"
# Crystal 1.2.2
"👨‍👩" # => "👨‍👩"

Both formats are technically correct. They describe the same string - a literal character is equivalent to its escape sequence. They are just two different representations which also result in different rendering.

I believe the intuitive expectation is that grapheme clusters should not break apart. The reasoning for escaping non-printable characters is to avoid having them go unnoticed because they are not visible. That does not apply as part of a bigger grapheme cluster because they typically have a visible effect there. So we should only escape non-printable characters that stand alone.

That's assuming the employed text renderer supports the respective grapheme cluster, which is impossible to detect or infer. But it's probably okay to assume grapheme cluster support? 🤔 A problem with that is that some grapheme clusters don't actually have a visual representation. For example, two consecutive Zero Width Joiners are considered a grapheme cluster. Escaping that seems like a good idea:

# Crystal master (since #11452)
"‍‍" # => "\u200D\u200D"
# Crystal 1.2.2
"‍‍" # => "‍‍"

A grapheme cluster consisting of only non-printable code points would be relatively easy to detect. Zero Width Joiner also attaches to most other code points forming a grapheme cluster. Even if it has no meaning or effect. That as well should still be relatively easy to detect. But I'm realistically expecting other problematic combinations of code points. It's really a complex matter.

So I suppose we have the options to prefer either readability or sanity with regards to non-printable characters in grapheme clusters. We could also try to find a middle ground that draws a more precise line between the two. Not sure how far we can get with that.

String#dump is an alternative for escaping all non-ASCII characters if you need that.

We could consider adding a configuration option which determines the handling of grapheme clusters. But I'm in doubt if that would be much useful and would definitely defer that to a future enhancement discussion. We should find a good default behaviour first.

A similar challenge exists with formatting literals (#11478). A change to escape non-printable characters has been reverted for now because of lacking grapheme support (#11603).

straight-shoota commented 2 years ago

The peers seem to be undecided.

Ruby

"👨‍👩".inspect # => "👨‍👩"
"‍‍".inspect    # => "‍‍"

Python

repr('👨‍👩') # => '👨\u200d👩'
repr('‍‍')    # => '\u200d\u200d'

Swift:

dump("👨‍👩") // => "👨‍👩"
dump("‍‍")    // => "‍‍"

(dump seems to escape onyl ASCII control characters, but I'm not very familiar with Swift. Any hints are appreciated)

Julia:

repr("👨‍👩") # =>"👨\u200d👩"
repr("‍‍")    # => "\u200d\u200d"
asterite commented 2 years ago

Let Elixir break the tie :-)

HertzDevil commented 2 years ago

They form a grapheme cluster that renders two persons as a single grapheme

This is misleading. A grapheme cluster may or may not be represented by a single glyph, and the purpose of text segmentation is not to determine glyph boundaries. For emojis, whether a single glyph or multiple glyphs should be rendered is determined by the list of emoji ZWJ sequences, and since 1F468 200D 1F469 is not one of them, that string is expected to be displayed as two glyphs no matter the ZWJ is present or not. In contrast, 1F468 200D 1F469 200D 1F466 👨‍👩‍👦 should be rendered as one glyph.

straight-shoota commented 2 years ago

@asterite Okay, here's Elixir:

inspect "👨‍👩" # => "👨‍👩"
inspect "‍‍"    # => "‍‍"
straight-shoota commented 2 years ago

@HertzDevil Thanks for pointing that out. However, the definition of ZWJ sequences is independent of the grapheme cluster algorithm. And the form of visualization is ultimately driven by the text rendering engine. Most implementations I've seen show 1F468 200D 1F469 as a single glyph.