NatTupper / dline

A feature-rich terminal discord client
GNU General Public License v3.0
62 stars 4 forks source link

Input fixes #6

Open brliron opened 6 years ago

brliron commented 6 years ago
brliron commented 6 years ago

The while call in gc.ui_thread.funcs blocks are there so that functions actually finish execution before moving on. Leaving them out would cause problems.

I thought they weren't needed, but after thinking about it, I can see a lot of reasons why they may be. I'll put them back.

Wide characters take up more memory. Normal characters still work with unicode, specifically UTF-8. Using wide characters may also cause locale issues since that would be UTF-16. Mixing UTF-8 and UTF-16 is messy.

It messes up the cursor on my end. Let's take for example the é character: https://www.fileformat.info/info/unicode/char/e9/index.htm. Its UTF-8 code is 0xC3 0xA9. With getch, you will read 2 characters: first the character 0xC3, then the character 0xA9. They will take 2 slots in the inputBuffer list. With the string "étude" for example, the inputBuffer list has a size of 6. And the cursor calculations use this size of 6, with the "é" string having a size of 2. If the cursor is right after the letter "d", pressing backspace will remove the "u" letter instead of the "d" letter. And every time you type a non-ascii character, it becomes worse. I really don't want to try with Japanese or Cyrillic... About "more memory" and "Mixing UTF-8 and UTF-16", here is how I understand things. The encoding of the Python 3 str type is an implementation detail. It may be UTF-8. It may be UTF-16. Anyway, it will always work, and Python code should just expect it to work for any unicode character, without thinking about implementation details. And actually, the implementation changed. Python 3.2 always used UTF-16 on Windows and UTF-32 on Linux. Python 3.3 uses ASCII, UCS-2 or UCS-4. See https://stackoverflow.com/a/9079985 and https://www.python.org/dev/peps/pep-0393/. Every time you specify an encoding, you specify an input encoding. For example, in the old version of the code with bytearray(self.inputBuffer).decode("utf-8"), the "utf-8" is the encoding of the bytearray. The string created will always use whichever encoding Python decides to use. Now, let's see about get_wch. The name of the Python get_wch function is misleading for a C programmer, because its name directly comes from the C get_wch function (https://linux.die.net/man/3/get_wch). The C function does return a "wide" character (probably in UTF-32). But the Python 3 function returns a str object, using the same encoding as every other Python str object. And because I wanted to check that, I went in the Python source code. The implementation of get_wch is here: https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Modules/_cursesmodule.c#L1320. The character returned by the C function get_wch is converted to a Python string with the PyUnicode_FromOrdinal function, which is documented here: https://github.com/python/cpython/blob/e42b705188271da108de42b55d9344642170aa2b/Include/unicodeobject.h#L1075.

I didn't think this wall of text would end up being that big.

the if ch == -1 is necessary because I've said input to nodelay. If it's omitted, it continues execution, causing issues. You did remove the nodelay, and it may not be necessary now since Discline-curses uses threads and doesn't rely on asyncio for everything anymore. I'll do some tests to see if it's still required.

While the Python getch is documented to return -1 when no input is available in no-delay mode, the Python get_wch is documented to throw an exception instead (https://docs.python.org/3/library/curses.html#curses.window.getch). If I understand correctly, getch never returns -1 in blocking mode, and get_wch never returns -1 in any mode.

The input buffer needs to be a list since characters need to be inserted in various places according to where the cursor is. Strings are immutable and therefore a new string would need to be created each and every time there is a change.

Yeah, I worked around that by creating a new string every time, but using a list of strings (I mean a list of character, but a character is a string of size 1) would probably be more efficient. I will change that.

The various time.sleep lines were originally necessary because of Discline-curses' dependence on asyncio. They are probably no longer required which you noticed.

I guessed they were needed to avoid eating 100% of the CPU by asking repetitively "do you have a key?" "And now?" "Still no new key?" "Gimme a new key plz", which is fixed by using blocking mode.

NatTupper commented 6 years ago

Wide chars aren't required because multibyte characters will be caught each segment at a time. So if a character is 3 bytes long, getch() will take the first byte and the next time it is run, it grabs the second, then the third. The reason I used a list for strings was getch() outputs an integer equivalent to the value of a character or a section of the character. I had tried converting them to characters with chr() and then storing them, but they never would combine correctly. So in order to store them correctly, you have to have some sort of iterable that stores each byte and then combine them when they're needed. Before I implemented this, non-ASCII characters could not be inputted. However I didn't take into account Cyrillic, extended-Latin characters, Greek, etc.. I did take into account CJK (Chinese, Japanese, Korean). Those should work. There's a little function that gets the correct length of each character and allows you to enter and erase them correctly, but it only works for CJK. I'll implement extended-Latin and Cyrillic next.

There's another problem with wide chars and that is that they may not work with characters more than 2 bytes in length. Wide chars are UTF-16, so they store 2 bytes. I'm not sure they'd be stored correctly if captured using get_wch(), which is why it's better to use getch().

You're right that nodelay mode is no longer required because Discline-curses now uses threads for most things. I can remove that and clean up some code.

brliron commented 6 years ago

There's another problem with wide chars and that is that they may not work with characters more than 2 bytes in length. Wide chars are UTF-16, so they store 2 bytes. I'm not sure they'd be stored correctly if captured using get_wch(), which is why it's better to use getch().

Everything I found leads me to think this is wrong. I didn't find where the C lib curses converts its characters, but the API returns a wint_t, which is 4 bytes wide. Then, the Python wrapper uses PyUnicode_FromOrdinal. The parameter of this function is an int (4 bytes), and "must be in range(0x110000)" - which is the full Unicode range. That function returns a Python string, which uses UTF-32 (or ASCII when the string contains only ASCII characters, because that uses less memory).

And I'm not talking about #2. Actually, my pull request has the #2 bug and I should fix it. But #2 is related to fonts, and asian characters being displayed 2 times bigger than other characters. The bug I tried to fix causes 2 issues:

If you really don't want to go for get_wch to fix this, you can get the width of an unicode character yourself and do the maths yourself. The array at https://en.wikipedia.org/wiki/UTF-8#Description describes how to do it.

NatTupper commented 6 years ago

I'd happily merge your pull request once I decide if I want get_wch() to be in use. Once I comment here about that, if you could modify your pull request accordingly, I will test your modifications and merge them once I'm satisfied with them.