VT Unicode Core Specification

contour-terminal / contour

Modern C++ Terminal Emulator

http://contour-terminal.org/

Apache License 2.0

2.42k stars 105 forks source link

VT Unicode Core Specification #404

Closed christianparpart closed 1 year ago

christianparpart commented 3 years ago

EDIT: Title changed to reflect the discussion we ended up with. Also see https://github.com/contour-terminal/terminal-unicode-core for the outcome.

Currently all (or almost all?) terminals which do handle VS15/VS16 do not change the width to narrow/wide but leave it to their base value or the preceding Unicode codepoint.

I think that is wrong. I.e. if the visual is changing, then the cursor should be moved accordingly.

But to keep that backwards compatible to all other applications / TEs, I think a soft migration should take place and I think that could be best done via DECRQM / DECSM / DECRM and a free DEC mode number to query/set/unset proper unicode width handling (including VS15/VS16).

I wonder what @jerch (or others) think about this idea? :-)

jerch commented 3 years ago

My idea on those things? Broken for our terminal world, because a "maybe render it like this" never gonna work for appside (or a multiplexer). It breaks with the promised runwidth idea of wcwidth, thus appside cannot reliable "paint" its screen anymore.

How to fix? Only way I see here is to extend the wcwidth promise by further speccing out, how these shall behave. Since appside will not know on its own, whether the terminal can render the shiny big glyph or just the ugly text representation, this render caps dependency needs to be addressed either by:

extend terminal interface, so a terminal can announce / appside can ask, how a certain (compound) character will be rendered, or
make fixated promises like switching the output representation never will chance the runwidth (thats what you think is wrong)

For 1. I see the problem, that it would involve alot of forth and back communication between terminal and app, which is always cumbersome. Thus I tend more towards 2., but thats more restrictive and will lead to poor output in edge cases.

Btw all new shiny compound features of unicode have that problem, they simply left the width issue to the renderer to decide. It totally screws up the separated render idea of cmdline apps, where the terminal works as the screen.

Closing the rant above: In long term imho the only solution is to give up the wcwidth promise, and let things flow more freely. But thats hard to swallow at least for canvas like curses apps, because it questions the grid mechanics per se.

j4james commented 3 years ago

Currently all (or almost all?) terminals which do handle VS15/VS16 do not change the width to narrow/wide but leave it to their base value or the preceding Unicode codepoint.

I think that is wrong. I.e. if the visual is changing, then the cursor should be moved accordingly.

Lets say the cursor is positioned in the bottom right corner of the screen, and the next character you receive is wide by default. It doesn't fit in that cell, so you're forced to wrap to the next line and scroll the screen. Then you receive another character, which is a variant selector, indicating that the previous character was actually meant to be narrow. How do you recover from that situation?

Technically the character is meant to be narrow, and thus should have fit on the previous line, and the screen should never have scrolled. But there's no way you now can go back and undo all of that. So this leaves you with a situation where the character looks narrow, but the number of cells it occupies may be 1 or 2, depending on where/when it is output.

I can't seem to access gitlab now, but I'm sure this has all been discussed before in terminal-wg, and I don't think anyone proposed a workable solution to the problem. It wasn't a matter of getting everyone to agree - it's just that there wasn't any solution that could be agreed to (at least that was my recollection).

jerch commented 3 years ago

@christianparpart Further note that those runwidth decisions cannot be made from an unicode version flag alone, as unicode explicitly left that up to the output system. E.g. take a compound family emoji like 👨‍👩‍👧‍👦 - both the multi glyph and the compound glyph representation are valid from unicode perspective, but for a terminal it means it can either render it as 👪 spanning 2 cells, or as 👨👩👧👦 spanning 8 cells.

Since we have no proposed solution for this yet, lets make one. Rough idea:

extend runwidth expectations with a default behavior Ideally here we find a least denominator, which will work across most terminals / output systems. For the example above I suggest to go with 8 cells as default. Note that this is really hard work to layout, as it basically means to go through all the compound and variation rules and find & define that default behavior fitting most systems.
extend terminal interface with a "render as you please" mode in DECSET or SM, plus a sequence to ask for runwidths By default a terminal would do what should be described under 1. With the new mode an app can tell the terminal to freely render things. Now it is up to the terminal to use the compound or the sequential glyphs, whatever it is capable of. The additional sequence allows the the app to ask the terminal, how wide it would render that emoji above, thus it can calculate with that in its screen layouts. The sequence needs abit thinking, prolly asking multiple chars at once is a good idea to reduce the request-response communication needs between app and terminal.

Edit: The sequence could be shaped like DECRQSS request-response cycling:

request:    DCS <TBD>  👨‍👩‍👧‍👦  ;  🏁  ;  ...  ST       // up to 16 (32|64?) entries
response:   CSI         8  ;   2  ; ...  <TBD>    // runwidths in CSI params

j4james commented 3 years ago

The sequence could be shaped like DECRQSS request-response cycling:

+1 to this. The only thing I'd add, is that it might be useful to follow the pattern of some of the other string-based queries that have a parameter to choose between pure text and hex encoding (for the characters you're wanting to test). See for example DECLBAN and DECDMAC. Although as long as you don't need to test the semicolon separator, or any of the control characters, that's probably not necessary. Worst case you could add that later.

jerch commented 3 years ago

The only thing I'd add, is that it might be useful to follow the pattern of some of the other string-based queries that have a parameter to choose between pure text and hex encoding (for the characters you're wanting to test). See for example DECLBAN and DECDMAC.

Yes thats indeed important, as not all parsers might allow UTF8 within sequence payload.

Although as long as you don't need to test the semicolon separator, or any of the control characters, that's probably not necessary. Worst case you could add that later.

My thinking with the separator was, that it would allow to handle arbitrary unicode content, thus also multiple chars from complicated scripting systems at once. The returned value would then rather denote the whole runwidth of that "phrase". (Works abit like a wswidth DB request and would allow to skip those half way broken wcwidth impls seen in same clibs.) For this I think we cannot make it collision-free without an explicit separator (maybe I miss something).

j4james commented 3 years ago

My thinking with the separator was, that it would allow to handle arbitrary unicode content, thus also multiple chars from complicated scripting systems at once.

Yeah, that seemed like a sensible approach to me. And as I said, I don't think the choice of separator is likely to be a problem because it's assumedly not something anyone would want to measure. But if they did, and we had the hex option, then they could still use the hex representation for it.

jerch commented 3 years ago

Some more thinking about such a mode extension:

Apps operating on the normal scroll buffer prolly dont care about individual runwidths, maybe this new mode could be set as default on that buffer. Not so for apps on alternate buffer, they normally have a strict idea about the screen layout, I think that mode cannot be set as default there without breaking many "canvas apps". They still can use it to their advantage, if they set the mode explicitly and do the sequence belly dance.

Roughly this leads to this scheme for default settings:

normal buffer - left to decide by the terminal itself (not sure, might cause frictions, then unset)
alternate buffer - always unset at beginning
buffer switching - always flip to the target buffer's default
multiplexer detached - only unset can give reliable results
multiplexer attached - might be able to derive a common setting from sub terminals, fallback to unset

Note that this mode doesnt care about the rendered glyph in the end. It is just a promise about the taken runwidth. Technically a terminal capable to do compound glyphs can render the family emoji from above in unset mode like:

compound/left-aligned:   |👨‍👩‍👧‍👦 | | | | | | |
compound/centered:       | | | |👨‍👩‍👧‍👦 | | | |
compound/right-aligned:  | | | | | | |👨‍👩‍👧‍👦 |
single glyphs:           |👨 |👩  |👧 |👦 |

Wow thats really hard to align here, hope you get the idea. To me only single glyphs make sense in unset mode, as it does not screw up output too much. But I think that should be left to terminal devs to decide.

Another problem directly arising from this is grapheme segmentation, and how to deal with those "spreaded compounds" at line ends. Currently I tend to treat them as non-breakable, if the segmentation algo says so, means we would get "ragged-right" line ends with early wrap-around, where more than 2 cells can stack up to one perceived character (flags for example must not break and would take 4 cells). Note that this is really hard to achieve, it means that all terminals would have to revamp their combining cell logic to support more than 2 at once, same with the cursor advance. So while this feels "more natural" to me, it might be way over what can be done / asked for.

(@christianparpart I hope I didnt go too much offtopic, as you only asked for the variation selectors. Imho to properly handle those we need to talk about fundamental handling of newer unicode concepts in the terminal first.)

j4james commented 3 years ago

normal buffer - left to decide by the terminal itself (not sure, might cause frictions, then unset)

I suppose it's up to the terminal to decide, but personally I'd expect this to be unset by default. Even in a normal buffer, people tend to do all sorts of fancy things with emojis in their prompts, and if you get the width wrong, those layouts are likely to break.

Currently I tend to treat them as non-breakable, if the segmentation algo says so, means we would get "ragged-right" line ends with early wrap-around

That seems sensible to me. You're already getting ragged-right when using emojis and ideographic languages, so this isn't much different is it? I don't know about the technical side of things, so maybe it's more complicated than I'm imagining, but I expect this issue will come up anyway if/when terminals try to support some of the more complicated writing systems, where characters can't reasonably fit in 2 cells.

jerch commented 3 years ago

That seems sensible to me. You're already getting ragged-right when using emojis and ideographic languages, so this isn't much different is it? I don't know about the technical side of things, so maybe it's more complicated than I'm imagining, but I expect this issue will come up anyway if/when terminals try to support some of the more complicated writing systems, where characters can't reasonably fit in 2 cells.

@j4james True, it basically extends the current CJK behavior with 2 cells to n cells for grapheme clusters. I am not deep enough into unicode to tell, whether this will help peeps in foreign scripting systems or not. Still I see a chance here to fix most of the problems above while sticking to the grid idea (until unicode brings the next absurd shenanigans haha). So yes, I am up for that.

But will we convince term devs to adopt to a much more complicated grid model? Grapheme clusters? Arbitrary "long" cells? There it is again - the hen-egg problem of serious everyday infrastructure. :smile_cat:

Edit: Ok, lets bring more problems attached to clusters on the table. What about the cursor? How shall a cluster be addressable from cursor movements? As one single char always? How to construct/edit during input? Some ideas:

A cluster spanning multiple cells is treated as one big cell during cursor marking, but leaves the individual movement underneath intact (you'd have to go 8 cells backwards to get to the cell right before the family emoji). Again similar to CJK handling.
Input prolly already needs IME helper in the first place. So clusters are likely to get added at once, not by individual codepoint input. If still things flow in as separated codepoints, the terminal would have to revise the old output after adding the incoming chars that (prolly new!) cluster. This only works reliable in consecutive input mode, not for overwriting existing cells after cursor jumps. This is also directly related to the question, where "to park" a cluster in the buffer (all in the first cell? spread across taken cells?). To me putting a cluster into the first cell it occured, seems easier to grasp and handle for follow-up writes and cursor moves (but this also means, that the terminal might have to do late wrapping, when the cluster starts to overflow at margin).
Edit though is more tricky to solve - what should be rendered after one BS+erase, or something got removed from the middle, or makes the cluster break apart? I have no good answer to that yet, maybe the terminal wants to blackout the cluster in question in normal output, and brings instead the IME helper up again. Idk if such a behavior is feasible at all. It touches all those problems we already have in richtext editors for more complicated char input. Maybe we can learn from them to some degree. A simple straight forward implementation would do the same as with CJK (cluster in first cell, touching the cell area with erase actions deletes the cluster).

christianparpart commented 3 years ago

Closing the rant above: In long term imho the only solution is to give up the wcwidth promise, and let things flow more freely.

My stance is that wcwidth should be deprecated (recommended not to be used) and wcswidth be used instead (assuming that wcswidth is aware of grapheme clusters and VS15/VS16 as maybe-future-defined of course :) ).

Lets say the cursor is positioned in the bottom right corner of the screen, and the next character you receive is wide by default. [...] How do you recover from that situation?

Okay, @j4james here you got me on the cold foot-step. I did not think of that case just yet, or in other words: I assumed that the base character will cause an auto-wrap and a following VS15 would indeed change width from X (say 2) back to 1 but not changing back to the prior cursor position (even though, now that I think of, that is technically possible and should not be too expensive, because at least in my case, I do keep track of the previous grid coordinate, so that unwrapping would be possible with some additional logic when U+FE0E is received and previous text character changed coordinate due to auto-wrap. I think that's not a deal breaker as long as it's well defined somewhere).

I can't seem to access gitlab now, but I'm sure this has all been discussed before in terminal-wg

OT: thinking positive here: we have a non-prejudicial clean-room discussion then :-)

[... ...] but for a terminal it means it can either render it as family spanning 2 cells, or as manwomangirlboy spanning 8 cells.

I think with a well defined DECRQM-mode (or similar) this can be very well defined, too. I may be dreaming a little bit too much of a perfect world here. When I was reading the TR#51 it was indeed stating that ZWJ emoji can "alternatively" be rendered with each emoji individually. Now, to me, that reads like a compromise to ZWJ-non-supporting unicode-rendering implementations still "conforming". Alacritty for example does render that emoji 👪 as 4 individual even (non-colored) emoji. But Alacritty gets a lot of emoji wrong, so I'd consider that TE as non-supporting and can be easily distinguished from those that do expose support for proper ZWJ emoji rendering (for example via DECRQM (not saying it has to be that way, it's convenient though :) )).

WRT your proposals, I think I ruled out proposal number 1 with that idea. Number 2 is interesting and actually something I had in my mind already too. I neglected it due to impractical use (IMHO). Then one may ask when it would use that (I'm eyeing at notcurses here for example). I think notcurses (or other apps/libs) would then use that VT sequence just with one or at most a few sequences to determine whether or not ZWJ sequences are rendered in the legacy way or the (i'd call it) the proper way. That's binary and may again fit more efficient into a RQM-style test that (on success) would give the guarantee to follow some "spec" as being well defined somewhere publically on the net.

To your DECRQM-style VT sequence, continueing to think about that idea: I cannot remember out of my head, but I am sure there are DCS that do also respond with DCS, so no need to allocate another CSI-response.

See for example DECLBAN and DECDMAC.

I am sure there are other CSI sequences that expect a textual parameter in there decimal representation even. With a quick check I found DECFRA which I have recently implemented - I actually thought there were more than just DECFRA :) .

Some more thinking about such a mode extension:

I'm glad that idea was picked up by you at least :-D ... Well, I am not sure a TE needs to have that flag based on buffer. Because the canvas-style app can use DECRQM to read the current state before going to work and just restore that upon application exit.

Multiplexers are always a special case that I at least also like to not forget about either. I think they won't have any issue because they can test the connecting client TEs for support too, and if the connecting TEs do not expose support or even a multiplexer is having multiple client TEs connected with varying support, then the multiplexer can inject a space character after those emoji / double-width characters that do not support this spec.

Note that this mode doesnt care about the rendered glyph in the end. It is just a promise about the taken runwidth

I hope I'm not acting too blunt here. But I think with runwith you actually mean how many grid cells a given grapheme cluster will occupy, right? So yeah, such a spec would be about:

support detectability (e.g. DECRQM)
mode enable/disable (enabling gives the cursor-movement guarantees and proper rendering of emoji)

Mind, I am still strongly against the idea of supporting the alternate emoji representation (e.g. 4 individual emoji instead of a family emoji).

Another problem directly arising from this is grapheme segmentation, and how to deal with those "spreaded compounds" at line ends. Currently I tend to treat them as non-breakable, if the segmentation algo says so

In my implementation I am strictly adhering to the grapheme cluster segmentation algorithm. So consecutive characters that are by definition unbreakable will always end up into the same grid cell. That is a guarantee that should then be made by such a fictional spec, too (if that mode is enabled).

@christianparpart I hope I didnt go too much offtopic, as you only asked for the variation selectors

No you didn't. It's kinda connected topic anyways. However, my whole point here was (and is) to take care of complex emoji with ZWJ and VS15/VS16 overrides, their cursor movement implications and display representations. I think that's small enough to not fear the talked-to-death syndrome.

p.s.: I didn't manage to process ALL posts yet, will resume later, but maybe we end up productively? ;-)

christianparpart commented 3 years ago

but I expect this issue will come up anyway if/when terminals try to support some of the more complicated writing systems, where characters can't reasonably fit in 2 cells.

Isn't it that all codepoints are mapped with an east asian width that can at most be Wide, which is interpreted as 2 grid cells? So it can AFAIU never exceed 2 grid cells.

What about the cursor? How shall a cluster be addressable from cursor movements?

A grapheme cluster occupies one grid cell, so no need to change cursor semantics. A grapheme cluster may trigger the cursor to move 2 columns instead of just 1 column forward. So technically you can address the empty grid cell that was jumped over earlier. I do not see an issue here because it's not simpler (nor more complex) as with DECDHL / DECDWL (double width/height characters).

How to construct/edit during input?

That's the job of the application, no need to specc that out.

jerch commented 3 years ago

My stance is that wcwidth should be deprecated (recommended not to be used) and wcswidth be used instead (assuming that wcswidth is aware of grapheme clusters and VS15/VS16 as maybe-future-defined of course :) ).

Agreed. To my understanding the whole wcwidth idea is flakey, if provided from standard system libs. I have a small hope, that a very fundamental definition, in combination with a sequence as described above for the more complicated things, would do in the end. Makes all those wrong wcwidth table issues obsolete. If in doubt, ask the terminal. A dream would come true :lollipop:

I think with a well defined DECRQM-mode (or similar) this can be very well defined, too. I may be dreaming a little bit too much of a perfect world here.

Agreed.

When I was reading the TR#51 it was indeed stating that ZWJ emoji can "alternatively" be rendered with each emoji individually. Now, to me, that reads like a compromise to ZWJ-non-supporting unicode-rendering implementations still "conforming".

Yes, there are many parts in unicode phrased either vaguely as maybe's, or directly as "left to the output system". Thats is not helpful for us at all, thus we have to do the dirty job of some "after-speccing".

To your DECRQM-style VT sequence, continueing to think about that idea: I cannot remember out of my head, but I am sure there are DCS that do also respond with DCS, so no need to allocate another CSI-response.

Well, I dont really care, if the response changes the sequence realm or not, a DCS ofc would be more free in its payload format. Problem I see here - its abit more involved to parse DCS correctly, while CSI is pretty darn simple (note we are talking here about appside digesting those responses, not terminal that should get that right in the first place). Furthermore most apps do a lousy job in reading back data from the terminal, I think we should make sure, that the response never exceeds POSIX's minimal PIPE_BUF size (imho defined as 512 bytes), otherwise the OS might chunkify things and the app/script goes bonkers.

I'm glad that idea was picked up by you at least :-D ... Well, I am not sure a TE needs to have that flag based on buffer. Because the canvas-style app can use DECRQM to read the current state before going to work and just restore that upon application exit.

I dont think so either, but I think we should get the defaults straight to not break half of the curses world, just for proper emojis. By making unset the default on alternate buffer, a canvas app not supporting the new mode, does not have to ask the terminal (it prolly isnt even aware, that it could ask for that mode), and will just keep working as before.

I hope I'm not acting too blunt here. But I think with runwith you actually mean how many grid cells a given grapheme cluster will occupy, right?

Yep, lol idk if there is a better english term for it, its the german "Laufweite".

Mind, I am still strongly against the idea of supporting the alternate emoji representation (e.g. 4 individual emoji instead of a family emoji).

Well, thats what I was trying to depict with those different alignment ideas above. I also dont think terminals should be pushed into those single glyphs, if they can just render the compound thingy well. Still the cell/cursor advance should follow the basic promise, if the new mode is unset. If the new mode is set, do as you please. Thats the idea.

In my implementation I am strictly adhering to the grapheme cluster segmentation algorithm. So consecutive characters that are by definition unbreakable will always end up into the same grid cell. That is a guarantee that should then be made by such a fictional spec, too (if that mode is enabled).

Yepp, sounds good to me. I dont think such a spec would need any claims about where to store things up, still the behavior must be clearly laid out. E.g. consecutive data, even across several chunks, will feed to the same cluster, if the segmentation algo says so. After cursor jumps or any other in between data I think that should not be the case, but instead treat cluster data as "broken" effectively overwriting previous cell content (which makes sense in terms of unicode data stream, a cluster should not magically continue after some terminal sequence in between following default unicode breaking rules). I am stating this explicitly here, as this is quite easy to be overlooked during "print handling" in the terminal.

Isn't it that all codepoints are mapped with an east asian width that can at most be Wide, which is interpreted as 2 grid cells? So it can AFAIU never exceed 2 grid cells.

To my understanding in some scripting systems like indian languages there are clustering constellations, that might lead to weird quarter/half width, stacking up to some bigger thingy in the end. In a normal wordprocessor those are dealt with by the font renderer from the font glyphs and their composition/ligature hints. We kinda have no easy way to do that in an (offscreen) terminal, plus we really dont want that (depending on font caps? haha). Therefore I think we need to get them specced in a certain way. Here it would be good to have someone with more experience in those scriptings onboard. I am not that one, so read my comment as hearsay.

christianparpart commented 3 years ago

After cursor jumps or any other in between data I think that should not be the case, but instead treat cluster data as "broken" effectively overwriting previous cell content (which makes sense in terms of unicode data stream, a cluster should not continue after some terminal sequence in between following default unicode breaking rules).

Exactly. That is what I meant with consecutive and that is how I implemented it.

I do not think that s minimal spec must include any definitions of the day mode is disabled (or better: not enabled). Because IMHO, that's the whole point, to get a well defined environment that you can access with this mode being enabled. If it is not enabled then the app must not expect anything as the behavior is as undefined as it is today. I care about a well defined environment that is surely not enabled by default (backwards compatibility.....) but if enabled we have all those guarantees we talked about so far.

I do not know such a minimal spec would need to take care of weird scripts with regards to east Asian width, as in the end we talk about this here in order to get emoji trending and it's cursor positioning right .

Sure, more could be defined and included into such a mode. But i fear that we then run into the rabbit-hole where we will not finish the idea.

What do you think?

j4james commented 3 years ago

I do keep track of the previous grid coordinate, so that unwrapping would be possible with some additional logic when U+FE0E is received and previous text character changed coordinate due to auto-wrap.

But how do you "unscroll" the screen when the wrapping happens on the last line? And bear in mind that the scrolling may have occurred within margins, in which case the line that scrolled off the top would have been erased completely, so it's not like you can just go back in the scrollback buffer.

And even if you did something where you kept a record of the last line scrolled, so you could unwind that as well, this doesn't seem to me like a workable solution, because whenever the unwind occurs, the screen is going to jump as it scrolls up and down.

The only solution that I thought might be reasonable, was something like a delayed wrap. So if you write a wide character on the last column of the page (and assuming it was capable of being narrowed), then you don't actually wrap immediately, but just display half the character (or maybe the narrow version, or nothing at all). Then when you receive the next character, either it's going to shrink and can be left where it is, or it's definitely wide and you can then safely trigger the wrap.

I don't particularly like that solution either, but if I absolutely had to support width-changing variants, that seemed like the least worst option to me.

Also note that wrapping is only one example of the problems you get with width-changing variants. Another case to consider is when Insert/Replace Mode is set (i.e. you're inserting) and you write out a wide character that pushes two cells off the right edge of the screen. Then you receive a variant selector which narrows that character, so now you need to undo one column of the insert, and somehow recover one of the characters that had been pushed off screen.

christianparpart commented 3 years ago

@j4james ooh right. I forgot about margins. Sorry.

I remember i once checked against web browsers and it turns out that (i tested with Chrome) VS15 does indeed change the presentation to text (try with any emoji) but it keeps the width of "Wide", which would be great for us. I forgot that I just wanted to have is a user experience convergence, so web emoji and terminal emoji should behave equally. I think we all may have had a temporary misunderstanding? IIRC VS16/VS16 is about changing presentation (colored vs text of emoji).

With that in mind, the only problem i see might be the copyright symbol. I think that by default has width 1 but can have VS16 applied too, so it does grow. There may be other symbols like that. But for the grow case i think we all can agree on a workable solution.

Did Imiss anything?

Trying to recap a small checklist of potential spec requirements:

consecutively (!) written non-breakable Codepoints will always end up in the same grid cell, leading to a grapheme cluster aware TE.
emoji symbols are always rendered in square (as required by TR51), implying a East Asian Width of Wide (2 grid cells), and requiring compound (ZWJ) emoji to always be rendered as compound emoji. The alternate rendering of ZWJ emoji therefore is considered invalid / not supported.
VS16 upgrades symbols to emoji presentation, leading to width 2, and potentially reflowing that symbol to the next line if on right margin with AutoWrap on
VS15 changes emoji presentation from emoji emoji to text emoji but retains width of 2. (This matches web Browser behaviors too)
emoji symbols regardless of Variation selectors (15/16) will move the cursor visually next to it, so move it by 2 columns instead of 1.
emoji written with the cursor at the right margin and with AutoWrap on will first trigger AutoWrap and then write the emoji character into the grid (aligns with CJK)
emoji written at right margin and with AutoWrap OFF will yield that character to be rendered only it's first half.
All of the above must be adhered to if the TBD (DEC) mode is on. Otherwise the behavior is as undefined as it is today.

Does this sound like it could convince other TE devs?

What did we miss in this list? What do you think? :)

jerch commented 3 years ago

@j4james ooh right. I forgot about margins. Sorry.

I think we should at least try to find/define a default behavior for the margin issues, that terminals should follow. Maybe something that respects late wrapping (in case a follow-up code leads to margin overflow -> wrap whole cluster), but ignores late "shrinking" (move things a row back with all the nasty scroll/row adjustments sounds really bad to me). This furthermore needs one escape rule - if the scroll region is smaller than the resulting cluster cell width, then things must be handled special.

I remember i once checked against web browsers and it turns out that (i tested with Chrome) VS15 does indeed change the presentation to text (try with any emoji) but it keeps the width of "Wide", which would be great for us. I forgot that I just wanted to have is a user experience convergence, so web emoji and terminal emoji should behave equally. I think we all may have had a temporary misunderstanding? IIRC VS16/VS16 is about changing presentation (colored vs text of emoji).

To me this was pretty clear, the problem remains with those very early symbols (idk - from unicode 1-3?). Some of them got relabeled as emojis and would fall under the 2-cell width rule with unicode 9+, others not and remain as 1 cell in text representation, and 2 cells in pictogram variant. Here I think we should spec all as 2 cells to lower the output ambiguity (as you indicate below).

With that in mind, the only problem i see might be the copyright symbol. I think that by default has width 1 but can have VS16 applied too, so it does grow. There may be other symbols like that. But for the grow case i think we all can agree on a workable solution.

Have not looked it up, I think there are like 2 or 3 tiny codepage areas with these early symbols, that show this (toxic) behavior.

Did Imiss anything?

Trying to recap a small checklist of potential spec requirements:

consecutively (!) written non-breakable Codepoints will always end up in the same grid cell, leading to a grapheme cluster aware TE.

I'd go one step further here and would state in a spec, that proper grapheme clustering handling is mandatory. And maybe even refer to the unicode version about the rules (once we have that unicode flag as discussed in terminal-wg). We cannot really make this a maybe, unless we want to ignore unicode progression again.

emoji symbols are always rendered in square (as required by TR51), implying a East Asian Width of Wide (2 grid cells), and requiring compound (ZWJ) emoji to always be rendered as compound emoji. The alternate rendering of ZWJ emoji therefore is considered invalid / not supported.

I am not sure if thats a good idea. Some output systems simply might not be able to construct the compound symbols, what should they output here, an "oops" placeholder? Why not leave that to the TE in combination with the sequence above? A TE capable to do compounds would report them as 2 cells, a TE without that feature prolly longer.

VS16 upgrades symbols to emoji presentation, leading to width 2, and potentially reflowing that symbol to the next line if on right margin with AutoWrap on

VS15 changes emoji presentation from emoji emoji to text emoji but retains width of 2. (This matches web Browser behaviors too)

emoji symbols regardless of Variation selectors (15/16) will move the cursor visually next to it, so move it by 2 columns instead of 1.

Yes to all, and as indicated above, we prolly should do the same for those older symbols as well.

emoji written with the cursor at the right margin and with AutoWrap on will first trigger AutoWrap and then write the emoji character into the grid (aligns with CJK)

Yes (note there might be an intermediate "not yet wrapped" state with later wrapping though, in case the cluster algo could not yet determine it as "emoji 2-cells wide")

emoji written at right margin and with AutoWrap OFF will yield that character to be rendered only it's first half.

This gets a downvote from my side. The DECAWM "off" state is kinda already broken for the last cell + its cursor handling for CJK, Why not simply clear the last cell, but refuse to draw the new "half" content?

All of the above must be adhered to if the TBD (DEC) mode is on. Otherwise the behavior is as undefined as it is today. What does that mean? Does this sound like it could convince other TE devs? What did we miss in this list? What do you think? :)

Few more things come my mind:

should mention: Other typical TE features stay intact, as long they where not altered by the "spec" explicitly. Means scroll regions, margins, normal cursor advance etc. pp works as before.
cursor advance over grapheme clusters prolly needs some in detail explanation
Since such a spec would introduce emoji picogram output as "officially" supported, maybe this needs some notes about SGR rules applied to those.
Imho we need a stance regarding text repr vs. picogram repr. A terminal is still a text thingy in the first place, so ppl might wonder, why not making that the default repr. Maybe at least something along this "TEs are free to choose repr variant based on their output caps".
needs an upgrade path: The grapheme thingy is bold, would be good to have that marked from TE side somehow, thus my idea with that "new mode". A new mode furthermore allows us to be more free in what to spec, as we can explicitly deviate from old behavior where needed. Trying to do that as the new default behavior for all TEs will let us fail, as the existing infrastructure entanglement will/cannot follow that easy, and we have created just another failed spec proposal.

christianparpart commented 3 years ago

To me this was pretty clear, the problem remains with those very early symbols (idk - from unicode 1-3?). Some of them got relabeled as emojis and would fall under the 2-cell width rule with unicode 9+,

With a new spec, I'd suggest to only care about recent Unicode (13+ / 14+ as of today), because this spec won't be fully implemented by any TE that does still want to stick to pre Unicode 9. (I mean, a TE can be pre-9 if that mode is not enabled, but once enabled, rules are as strictly defined).

Here I think we should spec all as 2 cells to lower the output ambiguity (as you indicate below).

I think TR51 is pretty clear on the presentation of emoji regardless of their codepoint: square. (i.e. 2 cells). There is nothing ambiguous about that. If a codepoint sequence is emoji and to be rendered in emoji presentation, there is an algorithm that I cannot give you a reference for (yet!). I did implemented based off the Google's Blink source code (yes, really :-D) and also adapted to their unit tests, and it all seems to work. I'm sure I can find the spec reference given some more time and space in my head :-)

With that in mind, the only problem i see might be the copyright symbol. I think that by default has width 1 but can have VS16 applied too, so it does grow. There may be other symbols like that. But for the grow case i think we all can agree on a workable solution.

Have not looked it up, I think there are like 2 or 3 tiny codepage areas with these early symbols, that show this (toxic) behavior.

I think it can still be well specced. promoting any grapheme cluster to emoji presentation via VS16 - based on the algorithm - will ensure its width is 2 instead of what it was before (1 or 2). For non-emoji graphemes the width algorithm will decide its length.

I'd go one step further here and would state in a spec, that proper grapheme clustering handling is mandatory

This implication was actually my goal. But can give that description the other way around, yes, i.e.: Grapheme cluster handling is mandatory and is implemented by mapping every consecutively(!) written character that is non-breakable to the preceding one to the previous cell and not moving the cursor instead of to the current position and moving the cursor. (I'm sure we can find a better wording, I'll leave that for later :) )

And maybe even refer to the unicode version about the rules

Yes.

(once we have that unicode flag as discussed in terminal-wg).

No. I mean, sorry, I'm not positive that will progress anytime soon. the features spec is (in my own opinion!) way to non-trivial already that it'll most likely end up in hibernation state as many others - unless someone someone is stepping up and actively pushing for it. Sorry to be so pessimistic here, I'm open for surprises though ;)

emoji written at right margin and with AutoWrap OFF will yield that character to be rendered only it's first half.

This gets a downvote from my side. The DECAWM "off" state is kinda already broken for the last cell + its cursor handling for CJK, Why not simply clear the last cell, but refuse to draw the new "half" content?

I chose that behavior because it is in line with how CJK "seems" to be handled as of today on most TEs (correct me if I'm wrong). Now also taking care of CJK and whatnot I fear we'll suffer the same syndrome as many other attempts the TE-community had in the past. OTOH I don't mind with not displaying that one, but then I'd say, isn't it better to display half of it than nothing at all? People might be wondering where there grapheme is and why it magically went away.

cursor advance over grapheme clusters prolly needs some in detail explanation

Cursor movement semantics stays fully intact. Since a grapheme cluster fits into one grid cell, when traversing over each column, you also traverse from one grapheme cluster to another. That means: you cannot place the cursor in the middle of a grapheme cluster and that also implies if you want to modify a grapheme cluster you have to fully rewrite that grapheme cluster.

Since such a spec would introduce emoji picogram output as "officially" supported, maybe this needs some notes about SGR rules applied to those.

SGR applies to a grid cell and therefore is applied to the complete containing grapheme cluster. I hope that is intuitive. If you have any particular question in mind, please ask :-)

Imho we need a stance regarding text repr vs. picogram repr. A terminal is still a text thingy in the first place, so ppl might wonder, why not making that the default repr. Maybe at least something along this "TEs are free to choose repr variant based on their output caps".

My stance is to obey the Unicode spec (Currently 13, soon 14), and yes, I know we are talking about a terminal, but if a user decides to put in an emoji (without VS15), I am sure he'll be confused if that's by default going to be in text presentation (as if you'd have appended VS15 to it). I think that would definitely be wrong.

needs an upgrade path: The grapheme thingy is bold, would be good to have that marked from TE side somehow, thus my idea with that "new mode". A new mode furthermore allows us to be more free in what to spec, as we can explicitly deviate from old behavior where needed. Trying to do that as the new default behavior for all TEs will let us fail, as the existing infrastructure entanglement will/cannot follow that easy, and we have created just another failed spec proposal.

Yeah, using SM/DECSM here would make that feature binary. One might need to introduce a new mode number in order to point to a newer version of a future-spec. I personally don't even think that is a problem, but I am not sure how others feel about that. Another solution would be to use a different sequence. But that would imply to mimic SM/RM and DECRQM. That sounds more complicated than simply sticking to a mode number and introducing a new mode number in case of changes. Then again one might propose to use sub parameters, such as SM ? <mode_number> : <version> h to enable this mode at a given version. But I am sure other people then would dislike that, too. I personally like last proposal the most with : <version> being defaulting to whatever is latest (currently 13, almost 14) by the time we decide to stick to it. Note, Unicode 14 does not have any implications on our topic we are addressing here.

p.s.: With all the discussions above (grapheme cluster + emoji + VS15/16 treatments) is what I was once starting to write together so something I called "Terminal Unicode Core" with the goal to have it all well defined that we are by accident talking about here. So yeah :-)

j4james commented 3 years ago

I don't have the bandwidth to follow this discussion closely, but I think I'm broadly in agreement with most things you guys have been saying. One point I would suggest is to avoid overspecifying stuff that doesn't matter. Wide characters that are clipped at the end of the line is a case in point (i.e. when autowrap is off). Just leave that as undefined behaviour - neither choice is going to break the layout of the page. Cursor movement over grapheme clusters is another one - there should be no need for an app to do that, so just declare it undefined. If you make too many rules that aren't essential, then terminals will just end up ignoring them anyway.

I think we all may have had a temporary misunderstanding? IIRC VS16/VS16 is about changing presentation (colored vs text of emoji).

Yeah, but I think (and I may be wrong) that there is an expectation that the text presentation occupies one cell, and the colored presentation occupies two.

With that in mind, the only problem i see might be the copyright symbol. I think that by default has width 1 but can have VS16 applied too, so it does grow.

You can see the full list of affected character here: https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-variation-sequences.txt

Those that are text/narrow by default I think are fine. As you say, making them wider shouldn't be a problem. It's just the ones that are wide by default, if we want to make them narrower when using the text presentation.

And it's worth noting that Kitty already supports this. But if you output a "narrowed" character in the last column it just gets wrapped anyway, even though technically it should fit. This has some unfortunate side-effects, but I suppose it's not that much worse than any of the other options. I don't know. Maybe this is another thing that should just be left undefined.

christianparpart commented 3 years ago

I don't have the bandwidth to follow this discussion closely, but I think I'm broadly in agreement with most things you guys have been saying.

One point I would suggest is to avoid overspecifying stuff that doesn't matter. Wide characters that are clipped at the end of the line is a case in point (i.e. when autowrap is off). Just leave that as undefined behaviour - neither choice is going to break the layout of the page.

Agreed.

Cursor movement over grapheme clusters is another one - there should be no need for an app to do that, so just declare it undefined.

Since a grapheme cluster must be always located in a single grid cell and we do not want to change any cursor movement VT sequence demantoid i would therefore say that there is nothing undefined or can be accident misunderstood. From my view that means that you cannot out the cursor in the middle of a multi codepoint grapheme cluster, because the minimum cursor jump take use one grid cell distance. Maybe i am thinking/writing a little bit to theoretical though.

If you make too many rules that aren't essential, then terminals will just end up ignoring them anyway.

I want to have a as minimal as possible spec to increaee Agreement and Adoption rate on all ends.

I think we all may have had a temporary misunderstanding? IIRC VS16/VS16 is about changing presentation (colored vs text of emoji).

Yeah, but I think (and I may be wrong) that there is an expectation that the text presentation occupies one cell, and the colored presentation occupies two.

Those that are text/narrow by default I think are fine. As you say, making them wider shouldn't be a problem. It's just the ones that are wide by default, if we want to make them narrower when using the text presentation.

I was basing my judgement out of how the web browser does render them. That should be what most users should feel familiar with.

christianparpart commented 3 years ago

@j4james i will summarize and formalize. What needs to be done in order to get WT buying in?

j4james commented 3 years ago

What needs to be done in order to get WT buying in?

There's an issue in the WT tracker (https://github.com/microsoft/terminal/issues/8000) where they've been discussing support for more advanced features of Unicode, as well as complex scripts. Initially I didn't think it was a good idea, because I assumed they would just break existing applications, but this mode idea of yours seems like it would be a solution to that problem (and if not a mode, then possibly the cluster measuring sequence that was discussed earlier).

But the first thing would be to decide whether you think your ideas align with what they're planning. There's much detail in that issue, but broadly speaking you can probably tell if you're likely to be in agreement with them or not. If you are in agreement, then maybe leave a note there describing your plans for the mode, and see whether they'll be interested in collaborating. Personally I'm in favour of the idea, but I'm just a contributor there - I can't speak for the WT team.

On the plus side, there are people at MS that are genuine experts on the subject, which would be helpful in covering areas of Unicode that you may not know about. The down side is that you may have to wait some time before they're ready to agree to anything.

christianparpart commented 3 years ago

Thanks @j4james . I keep you posted.

christianparpart commented 3 years ago

@jerch @j4james I'd like to kindly ask you to read https://github.com/contour-terminal/terminal-unicode-core/releases/tag/v0.1.0_prerelease_1 and maybe give some feedback on it. I hope I did address it all. We can use this document (it's source code / git repo) as base of the current state of discussion.

I try to keep that up-to-date and that is the document I'd like to forward to https://github.com/microsoft/terminal/issues/8000 once we've found a consensus at least all of us are comfortable with so we can get more feedback from others.

j4james commented 3 years ago

That looks good to me. I was expecting it to be more complicated, but if everything else is covered by the linked Unicode documentation then that's brilliant.

Answers to some of the questions in the sidebar:

For the Unicode version issue, I'd be happy to ignore it until it becomes a problem. We may be worrying about something that never happens again.
For feature detection, I think it's better not to even mention DA1. While I'm in favour of DA1 for feature detection in general, I'd rather reserve it for features that can't be detected in any other way, so it doesn't get overloaded unnecessarily.
Regarding skipped grid cells in the emoji section, I'm really not sure whether that needs to be explicit. I'm happy to leave that an open question for now and see what others have to say.

Minor nit regarding references: when you say "as described in 9", it would be a little clearer if you referenced the actual document name, e.g. "as described in UTS 29" (with a [9] reference link following that).

I also think some of the wording could possibly be made clearer, but that's something that can be polished later, once you've got feedback on the actual substance of the spec.

jerch commented 3 years ago

Wow, that draft is pretty on point and I am impressed, that you got it sorted that short. And I like it for being that short and concise. :+1:

For the Unicode version issue, I'd be happy to ignore it until it becomes a problem. We may be worrying about something that never happens again.

Yepp, I feel the same way here. If you care about different unicode version rules, maybe just point out in an additional sentence, that this was made with rules for unicode 11-13 in mind. That way we know later on, where the relevance might get thin again, because unicode introduced some new fancy stuff with 14+ or so.

For feature detection, I think it's better not to even mention DA1. While I'm in favour of DA1 for feature detection in general, I'd rather reserve it for features that can't be detected in any other way, so it doesn't get overloaded unnecessarily.

Agreed. Currently I also would not mess with DA1, furthermore stating something important like feature detection as maybe again ("The DA1 could be extended to also indicate support") is not helpful for a spec like thingy (either tells peeps to do that, so apps can grow confidence to find it there, or dont mention at all. I lean towards "dont mention" for now). Furthermore doubled feature reporting is awkward and will just lead to implementation/request frictions later on, so I am good with "request it exactly this way, period".

Regarding skipped grid cells in the emoji section, I'm really not sure whether that needs to be explicit. I'm happy to leave that an open question for now and see what others have to say.

Agreed. And if in doubt, well TEs prolly gonna do what they already do for CJK. So most likely there is no issue from that at all.

Note: above I said something about picogram and SGR handling - what I meant there was to make clear, how SGR attributes would apply to picograms. Should a TE make attempts to underline a picogram? BG color applied? What about FG? Bold? Thin? While I have a personal stance here, I also think it is not needed to be specced out in detail, but maybe encourage TEs to apply them in a sensible way. What really wins here - idk yet myself. (Prolly color masking from FG is way too much, but BG/underline etc makes totally sense to me)

About the performance considerations: I would not put something like that into the main document, as thats not part of the "spec". If at all, maybe into some addendum for implementation hints/details.

christianparpart commented 3 years ago

Thx guys for the feedback. I will integrate that she hopefully can give news ASAP, currently short on time. :)

christianparpart commented 3 years ago

https://github.com/contour-terminal/terminal-unicode-core/releases/tag/v0.1.0_prerelease_2

This is now having integrated your feedback. - I hope I did not miss anything. - but ping me if so, or if we can improve on anything else. :)

jerch commented 3 years ago

I wonder, if regional indicator (RI, country flags) should also be standardized by this? When I was initially dealing with grapheme rules, I found them to be more tricky, but cannot remember why (was it because of stacking and right margin handling? Idk...)

Edit: Oh right, it was because of their 1+next rule, had kinda troubles to get their bounderies right for multiple flags in a row...

christianparpart commented 3 years ago

@jerch in the other hand, country flags are just working fine on my end with the above rul s Plus proper Text shaping (maybe that is what can mess some TE devs up).

Because most TEs don't do any proper Text shaping at all but only render per text character. Kitty she's some tricks manually to get for example ZWJ emoji working. I chose to trust harfbuzz more than my own code.

I can do some additional tests later though.

jerch commented 3 years ago

@christianparpart Hmm yeah, the rules prolly cover RI just fine. Well it was more an issue on my end, how I ended up building the carry for cluster additions during single codepoint input (choosing a subpar abstraction).

christianparpart commented 3 years ago

@christianparpart Hmm yeah, the rules prolly cover RI just fine. Well it was more an issue on my end, how I ended up building the carry for cluster additions during single codepoint input (choosing a subpar abstraction).

If of interest we could do that implementation Details / recommendations addendum that covers some helpful insights on how to implement

grapheme cluster segmentation
emoji presentation segmentation
properly text shaping in the context of a terminal

We could also propose a C API for the unicode (not the Text shaping part) and a reference implementation. I think you still remember my RFC to https://github.com/contour-terminal/libunicode/blob/master/src/unicode/capi.h

jerch commented 3 years ago

If of interest we could do that implementation Details / recommendations addendum that covers some helpful insights on how to implement

Yes that would be good, as it would be valuable information to get things done (to me the scattered resources were more of a problem than the spec stuff itself).

christianparpart commented 3 years ago

Yes that would be good,

Okay. I will create that addendum based off my terminal text stack document then. As soon as I have some more dedicated time tonight or next night and notify you guys then.

j4james commented 1 year ago

I came across an old discussion of the VS15/VS16 selectors in the VTE issue tracker the other day (see issue 2317), and they highlighted something in the Unicode spec which I hadn't noticed before: namely that it doesn't actually recommend VS15 changing the width.

Quoting from UAX11 East Asian Width:

UTS51 emoji presentation sequences behave as though they were East Asian Wide, regardless of their assigned East_Asian_Width property value.

And an emoji presentation sequence is defined as an emoji characters follower by VS16 (for the official definition see here and here).

So that recommendation is clearly suggesting that VS16 would make a narrow emoji wide, but there isn't an equivalent recommendation saying a text presentation sequence should be narrow. That implies that they don't expect VS15 to have any affect on the width.

I know we reached the same conclusion here anyway, but I thought it was nice to know that the Unicode specs are in agreement are on that point.

christianparpart commented 1 year ago

Thanks, @j4james. And sorry for the late response!

it doesn't actually recommend VS15 changing the width.

Yeah, I settled with that myself now. IIRC, I had some discussion on VS15 not changing width recently (past few months) with someone and it made sense to leave it, while VS16 should indeed increase the width (as in: ensure it's wide).

I will make sure the VS Unicode Core Spec I drafted is reflecting that ASAP. (and also make sure we finish this ticket here) :-)

Have a sunny day, Christian.