akinomyoga / ble.sh

Bash Line Editorโ€•a line editor written in pure Bash with syntax highlighting, auto suggestions, vim modes, etc. for Bash interactive sessions.
BSD 3-Clause "New" or "Revised" License
2.36k stars 77 forks source link

Grapheme clusters and emoji sequences #117

Open ghost opened 3 years ago

ghost commented 3 years ago

ble version: 0.4.0-devel3+301d40f Bash version: 5.1.8(1)-release (x86_64-pc-linux-gnu) Emoji font: ttf-twemoji 13.0.1-1

This issue is a bit different depending on the terminal and font patches, but I'll try to explain it; unfortunately I couldn't get my recordings attached on this post ๐Ÿ™

After an emoji is typed (or pasted), typing more characters (or backspacing) will make the first character in the current word to be reprinted and the character typed to be swapped with the previous one, so if one types hello world ๐ŸŽƒ, moves the cursor to the r and types a for instance, the w will be reprinted and it will show hello wwoarld ๐ŸŽƒ. Typing at the beginning of a word treats the previous space separated field as its word, meaning if one types , before the w in hello world ๐ŸŽƒ (โ†3 spaces between hello and world), it will show hhello , world ๐ŸŽƒ.

This reprinting is not an editable character, it's just a printing; and as said before, typing or backspacing when the emoji is around will continue to cause the behavior, but when the emoji is deleted it will no longer cause the issue. If the statement with the emoji is executed and is now in ble's history, when the autocompletion for that statement shows up, it will cause the issue again.

Also, as soon as another word is detected after the word with the emoji, the issue dissapears, so it one would type ๐ŸŽƒ bye, the issue dissapears after typing the b, so one would see ๐ŸŽƒ๐ŸŽƒ๐ŸŽƒ bye, and keep typing normally. A single quote after the emoji (๐ŸŽƒ') will not make spacing detect a new word. A double quote after the emoji (๐ŸŽƒ") will make the problem stop until the closing double quote is typed. If the emoji is preceded by an opening double quote however, moving the cursor will also cause the reprinting, so if one types echo "๐ŸŽƒ and then moves the cursor backwards, it would see echo """"""""๐ŸŽƒ where the echo "" is just a printing and the actual characters were overwritten with """""

Now, this is the behavior with most emojis, but with some like โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ, the character typed gets moved forward and the previous rest of the word gets printed behind (and the cursor gets moved too), so from hello world โ™Ÿ๏ธ, typing a after the r would result in helloworlad โ™Ÿ๏ธ, and so on. This doesn't happen with their โ™ โ™ฅโ™ฆโ™ฃโ™Ÿ counterparts.

Finally, flag emojis (e.g. ๐Ÿ‡ง๐Ÿ‡ฌ) don't have this problem in most terminals I tested, which is interesting since the 2 emojis that compose a flag emoji (e.g. ๐Ÿ‡ง ๐Ÿ‡ฌ) do have the problem individually

I'll leave the following outputs from cat -A <<< "[EMOJI]" and bat -A <<< "[EMOJI]" ๐ŸŽƒ M-pM-^_M-^NM-^C$ \u{1f383}โŠ ๐Ÿ™„ M-pM-^_M-^YM-^D$ \u{1f644}โŠ ๐Ÿ˜ฑ M-pM-^_M-^XM-1$ \u{1f631}โŠ ๐Ÿ‘ป M-pM-^_M-^QM-;$ \u{1f47b}โŠ โ™ ๏ธ M-bM-^YM- M-oM-8M-^O$ \u{2660}\u{fe0f}โŠ โ™ฅ๏ธ M-bM-^YM-%M-oM-8M-^O$ \u{2665}\u{fe0f}โŠ โ™ฆ๏ธ M-bM-^YM-&M-oM-8M-^O$ \u{2666}\u{fe0f}โŠ โ™ฃ๏ธ M-bM-^YM-#M-oM-8M-^O$ \u{2663}\u{fe0f}โŠ โ™Ÿ๏ธ M-bM-^YM-^_M-oM-8M-^O$ \u{2663}\u{fe0f}โŠ โ™  M-bM-^YM- $ \u{2660}โŠ โ™ฅ M-bM-^YM-%$ \u{2665}โŠ โ™ฆ M-bM-^YM-&$ \u{2666}โŠ โ™ฃ M-bM-^YM-#$ \u{2663}โŠ โ™Ÿ M-bM-^YM-^_$ \u{265f}โŠ ๐Ÿ‡ง๐Ÿ‡ฌ M-pM-^_M-^GM-'M-pM-^_M-^GM-,$ \u{1f1e7}\u{1f1ec}โŠ ๐Ÿ‡ง M-pM-^_M-^GM-'$ \u{1f1e7}โŠ ๐Ÿ‡ฌ M-pM-^_M-^GM-,$ \u{1f1ec}โŠ

akinomyoga commented 3 years ago

This issue is a bit different depending on the terminal and font patches,

Yes. It depends on the terminal and its setting. Also, ble.sh doesn't support grapheme clusters.

I have several questions:

$ bleopt emoji_@ char_width_mode
$ declare -p _ble_util_c2w_auto_width
$ ble/util/s2chars ๐ŸŽƒ
$ echo "${ret[*]}"
$ for c in "${ret[@]}"; do ble/util/c2w "$c"; echo "w=$ret"; done

โ™ ๏ธ M-bM-^YM- M-oM-8M-^O$ \u{2660}\u{fe0f}โŠ โ™ฅ๏ธ M-bM-^YM-%M-oM-8M-^O$ \u{2665}\u{fe0f}โŠ โ™ฆ๏ธ M-bM-^YM-&M-oM-8M-^O$ \u{2666}\u{fe0f}โŠ โ™ฃ๏ธ M-bM-^YM-#M-oM-8M-^O$ \u{2663}\u{fe0f}โŠ โ™Ÿ๏ธ M-bM-^YM-^_M-oM-8M-^O$ \u{2663}\u{fe0f}โŠ ๐Ÿ‡ง๐Ÿ‡ฌ M-pM-^_M-^GM-'M-pM-^_M-^GM-,$ \u{1f1e7}\u{1f1ec}โŠ

ble.sh currently doesn't support these grapheme clusters and emoji sequences because it's technically involved. Even if ๐Ÿ‡ง๐Ÿ‡ฌ seemed to work, I think it still causes problems with line wrap. Maybe I try to support variational selectors by setting its width as 0 but I'm not sure for now if it doesn't cause other problems.

ghost commented 3 years ago
* **Q3**: What is your terminal?

I'm testing in Gnome-terminal, Konsole, Terminator, Alacritty and Kitty

* **Q1**: When you input the emoji in `ble.sh`, is the character shape correctly printed on the terminal?

I assume you're not exactly asking if the font looks as an emoji, but I have the CBDT/CBLC ttf-twemoji font which renders most emojis in terminals correctly out of the box (only Kitty prints flag emojis correctly, but maybe it's my configuration) with ble detached. Now, when ble is attached and the emoji is pasted, in Konsole, Kitty and Alacritty the character shape is correctly printed after pasting and after typing. In Gnome-Terminal and Terminator, the shape is also printed correctly if it's not preceded by a single ' or double quote ", otherwise the character shape is correctly printed after pasting but dissapears after typing (so in echo "๐Ÿ™„", the emoji disappears in the closing "). When a statement like echo ๐Ÿ™„ is executed, the emoji shape is always correctly printed in stdout.

The character shape of the โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ emojis in Konsole, Kitty and Alacritty are also correctly printed, but in Gnome-Terminal and Terminator they don't appear after pasting and after typing if preceded by ' or ". Flag emojis, either composed ๐Ÿ‡ง๐Ÿ‡ฌ or separated ๐Ÿ‡ง ๐Ÿ‡ฌ are always correctly shaped in all terminals (but as said, only font rendered correctly in Kitty).

* **Q2**: Does the emoji occupy two cells of the terminal?

In Konsole, Alacritty and Kitty yes; in Gnome-Terminal and Terminator no.

* **Q4**: What is the output of the following commands?
$ bleopt emoji_@ char_width_mode
bleopt emoji_version=13.1
bleopt emoji_width=1
bleopt char_width_mode=auto
$ declare -p _ble_util_c2w_auto_width
declare -- _ble_util_c2w_auto_width="1"
$ ble/util/s2chars ๐ŸŽƒ
$ echo "${ret[*]}"
127875
$ for c in "${ret[@]}"; do ble/util/c2w "$c"; echo "w=$ret"; done
w=1

Same output in all terminals; when doing it with ble/util/s2chars ๐Ÿ‡ง๐Ÿ‡ฌ, there are 2 characters of course, so 127463 127468 and w=1 w=1 are the ouputs of the other commands.

ble.sh currently doesn't support these grapheme clusters and emoji sequences because it's technically involved. Even if ๐Ÿ‡ง๐Ÿ‡ฌ seemed to work, I think it still causes problems with line wrap.

Line wrapping problems with flag emojis only occur in Konsole, but not in the other terminals because Konsole is the only terminal with the reprinting problem with flag emojis. Btw, line wrapping problems also don't occur with โ™ โ™ฅโ™ฆโ™ฃโ™Ÿ, they only appear along with the reprinting problem.

akinomyoga commented 3 years ago

OK! Thank you for your answers!

* **Q2**: Does the emoji occupy two cells of the terminal?

In Konsole, Alacritty and Kitty yes; in Gnome-Terminal and Terminator no.

Does it mean an emoji occupy one cell in GNOME Terminal and Terminator? If so, you need to set bleopt emoji_width=2 in Konsole, Alacritty and Kitty, and bleopt emoji_width=1 in GNOME Terminal and Terminator.

Edit: I've tried GNOME Terminal and Terminator, but they also behave as bleopt emoji_width=2. I think we should always use emoji_width=2 for the terminals with the emoji support.

* **Q1**: When you input the emoji in `ble.sh`, is the character shape correctly printed on the terminal?

I assume you're not exactly asking if the font looks as an emoji,

Ah, yes. I actually wanted to confirm that ble.sh receives the emoji correctly. If ble.sh fails to decode emoji in the user input, it will insert different characters in the command line string and print the different characters to the terminal. From your description and the output of the commands you provided, I think ble.sh correctly receives the emoji characters. So the problem is solely in the cursor position calculation of the output phase.

* **Q4**: What is the output of the following commands?
$ bleopt emoji_@ char_width_mode
bleopt emoji_version=13.1
bleopt emoji_width=1
bleopt char_width_mode=auto

The outputs are expected ones except for emoji_width. As I have mentioned above, you need to set emoji_width to the value corresponding to the terminal behavior.

Optionally, you may set emoji_version=13.0 since you seem to use ttf-twemoji 13.0.1-1 which is a font based on "Unicode 13.0 Emoji". Or, you may update the font to 13.1. It seems twemoji 13.1 has been released just two days before. Of course, the terminals you use also need to support 13.0 or 13.1.

ble.sh currently doesn't support these grapheme clusters and emoji sequences because it's technically involved. Even if ๐Ÿ‡ง๐Ÿ‡ฌ seemed to work, I think it still causes problems with line wrap.

Line wrapping problems with flag emojis only occur in Konsole, but not in the other terminals because Konsole is the only terminal with the reprinting problem with flag emojis. Btw, line wrapping problems also don't occur with โ™ โ™ฅโ™ฆโ™ฃโ™Ÿ, they only appear along with the reprinting problem.

Hmm, OK. I think it is also related to the terminal behavior. ble.sh doesn't recognize any grapheme clusters composed of multiple Unicode code points, so when the flag emoji is printed at the last column of the terminal, the two constituent code points XY may be placed differently in the internal ble.sh logic and in the actual terminal.

(A) Assumption in ble.sh (which treats X and Y as independent characters)
+--------------------+
|                   X|
|Y                   |
+--------------------+

(B) Actual terminal that treats XY as a grapheme cluster in the layout phase
+--------------------+
|                    |
|XY                  |
+--------------------+

But some terminals may behave as (A) in the layout phase and only resolve emojis in the rendering phase. In that case, the problem doesn't occur since the behavior matches with ble.sh's assumption.

ghost commented 3 years ago
* **Q2**: Does the emoji occupy two cells of the terminal?

In Konsole, Alacritty and Kitty yes; in Gnome-Terminal and Terminator no.

I rechecked, and actually most emojis like ๐ŸŽƒ are 2 cells long in all terminals, I was trying with the โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ ๐Ÿ‡ง ๐Ÿ‡ฌ emojis and those are the ones that are 2 cells long in Konsole, Alacritty and Kitty and 1 cell long in Gnome-Terminal and Terminator.

If so, you need to set bleopt emoji_width=2 in Konsole, Alacritty and Kitty, and bleopt emoji_width=1 in GNOME Terminal and Terminator.

Yeah!! That solved the reprinting problem with most emojis, thanks!! I forgot about it in blerc. The issues that still persist are the reprinting and line wrapping of grapheme clusters โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ (and ๐Ÿ‡ง ๐Ÿ‡ฌ ๐Ÿ‡ง๐Ÿ‡ฌ in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Optionally, you may set emoji_version=13.0 since you seem to use ttf-twemoji 13.0.1-1 which is a font based on "Unicode 13.0 Emoji". Or, you may update the font to 13.1. It seems twemoji 13.1 has been released just two days before. Of course, the terminals you use also need to support 13.0 or 13.1.

Oh thanks for the suggestion, but changing the version didn't seem to solve anything itself. I'll keep it up to date in any case.

Hmm, OK. I think it is also related to the terminal behavior. ble.sh doesn't recognize any grapheme clusters composed of multiple Unicode code points, so when the flag emoji is printed at the last column of the terminal, the two constituent code points XY may be placed differently in the internal ble.sh logic and in the actual terminal.

(A) Assumption in ble.sh (which treats X and Y as independent characters)
+--------------------+
|                   X|
|Y                   |
+--------------------+

(B) Actual terminal that treats XY as a grapheme cluster in the layout phase
+--------------------+
|                    |
|XY                  |
+--------------------+

Oh, maybe I was referring to line wrapping of the autocompletion. When an emoji has the reprinting problem and the autosuggestion exceeds the last column, it reprints it below the current line and messes up the cursor position as well, something like

 +--------------------+
 |๐ŸŽƒaaaaaaaaaaaaaaaaaa|
 |a  โ–ฎ                |
 |a                   |
 +--------------------+

As bleopt emoji_width=2 solves the reprinting problem for most emojis, it doesn't happen anymore for those, just for the grapheme clusters. As for the example you mentioned, it is indeed what happens for most terminals, just Kitty does the line wrapping like this:

+--------------------+
|                    |
|X                   |
|Y                   |
+--------------------+

Thanks again

akinomyoga commented 3 years ago

I rechecked, and actually most emojis like ๐ŸŽƒ are 2 cells long in all terminals, I was trying with the โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ ๐Ÿ‡ง ๐Ÿ‡ฌ emojis and those are the ones that are 2 cells long in Konsole, Alacritty and Kitty and 1 cell long in Gnome-Terminal and Terminator.

Yeah, treatment of grapheme clusters and their components is the are that the behavior of terminals and applications differ from one another the most. The different levels of conformance to the Unicode standard come from the technical difficulty of implementing the full Unicode specification.

The issues that still persist are the reprinting and line wrapping of grapheme clusters โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ (and ๐Ÿ‡ง ๐Ÿ‡ฌ ๐Ÿ‡ง๐Ÿ‡ฌ in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Well, they are all related to the grapheme clusters that ble.sh doesn't support.

Oh, maybe I was referring to line wrapping of the autocompletion. When an emoji has the reprinting problem and the autosuggestion exceeds the last column, it reprints it below the current line and messes up the cursor position as well, something like

 +--------------------+
 |๐ŸŽƒaaaaaaaaaaaaaaaaaa|
 |a  โ–ฎ                |
 |a                   |
 +--------------------+

As bleopt emoji_width=2 solves the reprinting problem for most emojis, it doesn't happen anymore for those, just for the grapheme clusters. As for the example you mentioned, it is indeed what happens for most terminals, just Kitty does the line wrapping like this:

+--------------------+
|                    |
|X                   |
|Y                   |
+--------------------+

Hmm, I think that is kitty's glitch. Maybe I can support grapheme clusters someday, but I will never support kitty's behavior...

I also checked the behavior of other shells' line editors. It seems that readline recognizes the grapheme clusters and works well in GNOME Terminal (but not in kitty). Zsh avoids handling the grapheme clusters directly but instead shows an ASCII representation of variation selector as <fe0f>. Fish 2.7.1 doesn't work at all in my environment both in kitty and GNOME terminal. I also tried set fish_emoji_width 2 but it didn't change anything. I found this issue https://github.com/fish-shell/fish-shell/issues/5583, so it's just because my fish (in Ubuntu 18 LTS) is too old.

ghost commented 3 years ago

The issues that still persist are the reprinting and line wrapping of grapheme clusters โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ (and ๐Ÿ‡ง ๐Ÿ‡ฌ ๐Ÿ‡ง๐Ÿ‡ฌ in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Well, they are all related to the grapheme clusters that ble.sh doesn't support.

Even all emojis inside quotes dissapearing? I also found that when that happens, if an autosuggestion appears inside those quotes, the emoji reappears, but well if it's as you say, there's not much to do.

I also checked the behavior of other shells' line editors. It seems that readline recognizes the grapheme clusters and works well in GNOME Terminal (but not in kitty). Zsh avoids handling the grapheme clusters directly but instead shows an ASCII representation of variation selector as <fe0f>. Fish 2.7.1 doesn't work at all in my environment both in kitty and GNOME terminal. I also tried set fish_emoji_width 2 but it didn't change anything. I found this issue fish-shell/fish-shell#5583, so it's just because my fish (in Ubuntu 18 LTS) is too old.

I did notice that <fe0f> in zsh inside Konsole, Alacritty and Terminator but not in my GNOME Terminal, there zsh actually renders it correctly but the following character is a bit buggy. In fish I noticed grapheme clusters seem to mess fish's fish_right_prompt function. It seems that in those shells Konsole is the most buggy terminal, it causes reprinting issues in zsh similar to what I described previously, and reprints new lines of the prompt in fish. If I remember correctly, Konsole got emoji support not so long ago, and font rendering is not perfect, it doesn't show my twemoji font, instead prints some other font. Just something to keep in mind if support ever comes in ble.

akinomyoga commented 3 years ago

The issues that still persist are the reprinting and line wrapping of grapheme clusters โ™ ๏ธโ™ฅ๏ธโ™ฆ๏ธโ™ฃ๏ธโ™Ÿ๏ธ (and ๐Ÿ‡ง ๐Ÿ‡ฌ ๐Ÿ‡ง๐Ÿ‡ฌ in Konsole), and the dissapearance of emojis after quotes with Gnome-Terminal and Terminator

Well, they are all related to the grapheme clusters that ble.sh doesn't support.

Even all emojis inside quotes dissapearing? I also found that when that happens, if an autosuggestion appears inside those quotes, the emoji reappears, but well if it's as you say, there's not much to do.

OK. Actually, I cannot reproduce this behavior in my GNOME Terminal. What is the version of your GNOME terminal? Maybe I also try Terminator later.

I did notice that <fe0f> in zsh inside Konsole, Alacritty and Terminator but not in my GNOME Terminal, there zsh actually renders it correctly but the following character is a bit buggy.

Hm, OK. Zsh is clever enough to switch the behavior depending on the terminal. My naive guess is that Konsole, Alacritty and Terminator implement their own width determination of emoji characters and sequences, but GNOME terminal uses the system wcwidth the same as zsh.

In fish I noticed grapheme clusters seem to mess fish's fish_right_prompt function. It seems that in those shells Konsole is the most buggy terminal, it causes reprinting issues in zsh similar to what I described previously, and reprints new lines of the prompt in fish. If I remember correctly, Konsole got emoji support not so long ago, and font rendering is not perfect, it doesn't show my twemoji font, instead prints some other font.

OK, thanks for the information. Yeah, this is one of the messiest areas in terminals. I remember the discussion at Terminal WG #9.

Just something to keep in mind if support ever comes in ble.

I currently have two different approaches in my mind. (a) One approach is to treat clusters as one character in text editing. For example, pressing delete after a grapheme cluster deletes the entire cluster, (b) Another approach is that we don't change the text editing but just change how they are laid out in terminals. In this case, pressing delete after e.g. โ™ฅ๏ธ will just delete a variation selector and turn it into a plain โ™ฅ.

Also, I need to support grapheme clusters and emoji sequences in prompts separately. The layout of prompts is treated in different logic because they are static texts, unlike the command line strings.