alvinlindstam / grapheme

A python package for grapheme aware string handling
MIT License
108 stars 7 forks source link

Dealing with NULL character #15

Closed worldmaker18349276 closed 3 years ago

worldmaker18349276 commented 3 years ago

The NULL character chr(0) is not a grapheme and will not break the combining character

>>> print("A\0B")
AB
>>> list(grapheme.graphemes("A\0B"))
['A', '\x00', 'B']
>>> print("A\0\u0300B")
ÀB
>>> list(grapheme.graphemes("A\0\u0300B"))
['A', '\x00', '̀', 'B']
alvinlindstam commented 3 years ago

Hi

I'm not sure what you mean by "NULL character chr(0) is not a grapheme". As I understand it, characters (unicode scalar values aka code points) themselves are not graphemes, but may form grapheme clusters based on their relation to surrounding characters/code points.

From what I can gather, from a grapheme property point of view U+0000 belongs to the Control [https://unicode.org/reports/tr29/#Grapheme_Cluster_Break_Property_Values](Grapheme_Cluster_Break Property group), and is to be treated as any other control character for graphemes according to Annex 29. See https://www.unicode.org/Public/UCD/latest/ucd/auxiliary/GraphemeBreakProperty.txt for source:

0000..0009    ; Control # Cc  [10] <control-0000>..<control-0009>

Many platforms handle that character in a special manner though. In HTML, it's completely stripped. In UNIX, it may represent the end of the text.

In a python repl, printing the string "A\0\u0300B" is rendered with the combining mark on the A as you said. I think that is due to python, or the repl, stripping it out from the rendered text completely to protect from those "end of text" scenarios, so the string rendered is actually "A\u0300B". This can be noticed by copy pasting the printed string, and inspecting it's contents:

Screen Shot 2021-01-07 at 23 11 30

In my browser terminal, the presence of the null character also break the connection to the A:

Screen Shot 2021-01-07 at 23 27 02

Given that I intend this library to be an implementation of the default grapheme cluster as defined by annex 29, I'm not sure if I'd want to introduce any special casing for that code point. I think it might be common for it to mess up whatever this library is used for in most terminal or web scenarios though as I think it might often be stripped automatically on those contexts before rendering.

Do you have ideas for how to handle it (API wise) without breaking spec compatibility be default?

worldmaker18349276 commented 3 years ago

I think the control characters are processed by the terminal, because the bash script will produce the same result. And I found that other control characters (such as \b, \t, \r, \n, \v, \f) have the same problem

>>> print("A\u0300B") # 'A\u0300' 'B'
ÀB
>>> print("AB\b\u0300") # 'A\u0300' 'B'
ÀB
>>> print("中\u0300文") # '中\u0300' '文'
中̀文
>>> print("中文\u0300") # '中' '文\u0300'
中文̀
>>> print("中文\b\u0300") # '中' '文\u0300'
中文̀
>>> print("中文\b\b\u0300") # '中\u0300' '文'
中̀文
>>> print("A\0\u0300") # 'A\u0300'
À
>>> print("A\u200b\u0300") # 'A\u200b\u0300', but looks like À in terminal (it rendered like 'A\u0300' although there is a control character in the middle)
A​̀
>>> print(" \u0300") # ' \u0300', the position of the marker is further left than À in the terminal
 ̀
>>> print(" \u200b\u0300") # ' \u200b\u0300', but looks like ' \u0300' in terminal (it isn't a overlap between ' ' and '\u0300', otherwise the marker position will be the same as 'A\u200b\u0300')
 ​̀
>>> print("A\b\u0300") # 'A'
A

The terminal seems to process those control characters due to their visual position rather than code points or graphemes, and try to combine them graphically. I tested them in gnome terminal on ubuntu 16.04, but I think others terminals work the same way.

worldmaker18349276 commented 3 years ago

I'm looking for a method to manually handle such position control characters to print in the terminal, but the way terminals process graphemes seems to be different from the standard specification. I misunderstood the goal of this API, and the problem is caused by the implementation of the terminal emulator, which has nothing to do with the implementation of this package. The best way to solve position problem in the terminal is to use curses, otherwise you will need to process them manually after decomposition, but this requires information about the graphemes width. That is why I opened another issue.