Add 'effective utf8' field to ncinput struct.

Spritetm commented 1 year ago

Proposed fix for https://github.com/dankamongmen/notcurses/issues/2695 . This code adds an utf8_eff field to the ncinput struct; it gets filled with the 'effective' keypress value, that is, the UTF8 that the keypress generates with all the meta keys taken into consideration.

Looking at the 'reader' widget, it uses the the ncinput.utf8 for this, and for non-Kitty-keyboard-disambiguation terminals, it effectively acts like this. However, the docs imply that this is non-intentional and simply a result of other terminals not returning the 'base' key, so notcurses cannot know what that was.

As such, I decided to add an extra member to the ncinput struct rather than re-using the ncinput.utf8 field. The extra advantage is that this is backwards compatible with any other program as the meaning of existing fields stays the same.

dankamongmen commented 1 year ago

ncinput is not a size-critical structure, so i don't mind adding a new field if it's valid
you seem to know what you're about, so i don't doubt that this is a good solution on kitty
holy shit, you had to fuck with the input automata? and did so? i bet that was an adventure. nicely done.
the entire input mechanism is erudite and definitely the most fragile part of notcurses. changing it can easily break things in terminals you're not actively testing, and it's difficult to meaningfully test everything related to input. with that said, you have a valid need (that i admittedly haven't yet fully grokked; i would need to context switch into notcurses and terminal land for a few hours to do so), and a solution that seems promising if similarly not yet fully understood

so.

i'd like to just merge this straight through, trusting that your intelligent writeups and solid code implies a thorough understanding of both the problem and notcurses. i would do so for anything but an input problem. i've just been burned there too many times.

so.

if you can provide me with evidence that you tested this with the other major terminals, and ncinput still ran correctly in them for a variety of input, i'll merge it after a quick readthrough. this evidence could be nothing more than "i tested these terminals with this methodology". i don't need screenshots or anything; i trust you.

or.

you can wait for me to take a day and read back through the input code, reacquaint myself with all the crap surrounding it, and perform such testing myself. this might happen in january, but i honestly doubt it. feels more like a mid-february thing.

let me know how you'd like to move forward. i really appreciate the obvious quality here. in fact, if you wanted to talk about taking over maintainership, i'd be very interested in that =].

Spritetm commented 1 year ago

I think you have more trust in my code than it's worth, it was written as a sort of proof-of-concept to see if you had any major issues with the architecture. To illustrate my point: there's actually a blunder in the design due to me misunderstanding Kittys protocol. I'll fix and then I'll answer your main points.

Spritetm commented 1 year ago

Okay, first of all: my derp was that I thought Kitty returned the effective codepoint for a key as one or more UTF8 bytes. It does not do that; what it does is that it returns 1 to n codepoints as utf32 values; seemingly it's possible that a keypress with some meta keys pressed gives multiple codepoints. Yay for weird international keyboards I guess?

Anyway, I'm okay with having this code being committed as I'm decently sure of the low-level C implementation. I would like to run some architectural decisions by you, which also gives me a nice way of rubber-duck-reviewing this code. (For reference: I found the issue above by doing just that.)

I added what is now an .eff_text (used to be utf8_eff) field to the ncinput struct. This is an array of up to 4 codepoints that are the 'effective' text, in other words: if I were to go into a wordprocessor and held the meta keys in a certain way before pressing the key, '.eff_text' contains the codepoints of the text that would appear. I set the max to 4 code points because... well... I just guessed that there likely aren't any keyboards that will return more codepoints than that? I'm no foreign keyboard expert :) eff_text is also an UTF32 array with no corresponding UTF8 equivalent in the ncinput struct; this means that users that need utf8 need to call notcurses_ucs32_to_utf8 to convert the codepoints themselves.
The Kitty keyboard level is now set to 1+2+8+16 rather than 1+2+8. This makes it return the codepoints described above as extra fields concatenated to the input event strings. Same limit here: I wrote input automaton matching strings for 1, 2, 3 and 4 code points, if some weird keyboard returns more it won't match. This also assumes that if some terminal doesn't understand level 16, it will silently ignore it, making behaviour revert to what it was before my patch. (Although it does mean eff_text will not take meta keys into account anymore.)
If the Kitty protocol is not used, I assume that ncinput.id is set to whatever keycode the key generated and set .eff_text[0] to that. This means that code using .eff_text degrades gracefully if no Kitty keyboard extension is used.

Wrt testing: do you have any good targets wrt terminals? I tried Kitty, screen and xterm, and my testing procedure mostly consisted of a program I'm developing that effectively prints out the eff_text codepoints: I tried entering '123!@#abcABC' and in all cases it got printed out correctly. I also ran notcurses-input in the same terminals and entered the same string and it did the same thing it did before my patch (return 123!@#abcABC for xterm and screen, return 123123abcabc for kitty)

Spritetm commented 1 year ago

Also, not sure what the issue with the input automata is supposed to be? It looked like a fairly standard string matching thing, I got the needed matching rules correct the first time around.

dankamongmen commented 1 year ago

Also, not sure what the issue with the input automata is supposed to be? It looked like a fairly standard string matching thing, I got the needed matching rules correct the first time around.

i'm pleased to be talking to someone who considers string-matching automata fairly standard; they seem to blow a lot of folks' minds

Spritetm commented 1 year ago

Fyi, I also tried this on uxterm, roxterm, and wezterm; behavior is as expected. I also tried this using non-ascii-based keyboard layouts (Arabic in this case) and it seemed to... well, not break? I don't have an Arabic keyboard nor am I fluent in the language, so I can't say there aren't bugs. Still haven't found a keyboard layout that sends multiple codepoints when shifted, so I can't test that, but as the code for that is fairly trivial, I assume there's a somewhat small chance for bugs there.

If this is enough testing for you and the architecture isn't detestable, feel free to merge this.

dankamongmen commented 1 year ago

Fyi, I also tried this on uxterm, roxterm, and wezterm; behavior is as expected. I also tried this using non-ascii-based keyboard layouts (Arabic in this case) and it seemed to... well, not break? I don't have an Arabic keyboard nor am I fluent in the language, so I can't say there aren't bugs. Still haven't found a keyboard layout that sends multiple codepoints when shifted, so I can't test that, but as the code for that is fairly trivial, I assume there's a somewhat small chance for bugs there.

If this is enough testing for you and the architecture isn't detestable, feel free to merge this.

i think i shall merge it now. thanks a lot for the great work!

dankamongmen / notcurses

Add 'effective utf8' field to ncinput struct. #2696