Make support for UTF-8 configurable

lwi commented 4 years ago

Automatic detection for UTF-8 is wrong for charsets like ISO 8859-1 (aka Latin-1). Because Latin-1 uses the 8th bit it would be miss-identified as bogus UTF-8 while it just needs to be passed through. When dealing with legacy applications (not using UTF-8 but such Latin-1 charsets) one must be able to turn off UTF-8 in order to correctly interpret the output from them.

TragicWarrior commented 4 years ago

@lwi , is this still relevant? if so, can you rebase. there are currently conflicts.

lwi commented 4 years ago

@TragicWarrior, it is. However when working on that topic I might extend it a little. The UTF-8 handling could be improved, we could introduce an option flag for a strict UTF-8 mode that would drop invalid bytes (except maybe C1 controls of course).

TragicWarrior commented 4 years ago

@lwi , you are correct. the UTF-8 decoder in the library is relatively primitive. it doesn't comprehend things like invalid codepoint blocks and will most certainly confuse a multibyte encoding with a C1 control. have you given any thought to what the code might look like if fixed up a bit?

TragicWarrior commented 4 years ago

@lwi , can you rebase.

lwi commented 4 years ago

@TragicWarrior, I rebased. Instead of maybe only having a "on/off" flag for UTF-8 we could provide the level of enforcement/support:

none: no UTF-8 detection -> all characters will be stored and passed through "as is". Regular detection of C1 control characters
strict: invalid UTF-8 bytes (like continuation bytes without marker byte or marker byte in a sequence where a continuation was expected) will be dropped (this includes un-encoded C1 control characters)
strict_except_c1: same as above but accepts/interprets unencoded C1 control characters
lazy: we detect valid UTF-8 sequences. invalid UTF-8 continuation bytes will be stored/passed through "as is" and NOT be dropped. However a "started" UTF-8 sequence with an invalid character will be dropped.

Instead of just dropping invalid characters one can replace it with "invalid character" symbol.

lwi commented 4 years ago

@TragicWarrior, thanks. What do you think about my suggestions?

TragicWarrior commented 4 years ago

@lwi , Sorry I took a few days off for Christmas.

I went ahead and merged in this PR. Generally, I agree with the objectives above.

The current code does not check for the markers in continuation bytes and that's pretty much an artifact of me being hasty so i think it should be added--what to do with unexpected / non-compliant bytes is a different matter. I suspect you know the answer to this better than I which is, "how often does malformed utf8 occur?". The second question, would be, "do we care?" I think how these questions are answered drive the relevance of these cases.

none - I have zero objection to this but it shouldn't be the default. I personally have zero use for it, so it's kind-of a shrug for me.

strict - In my understanding on this, invalid utf8 (continuation bytes with no marker) are supposed to be treated as Windows-1252 (as opposed to being dropped). I'm not sure about unencoded C1 bytes being dropped--it seems like that's the right thing to do, because, afaik, they're supposed to be encoded--did you observe otherwise with Fedora 31 though?

strict_except_c1 - Does this really happen (per my comment above regarding Fedora 31)? Are you aware of common happenstances for this?

lazy - On this one, again, afaik, non-compliant continuation bytes should be regarded as Windows-1252 so passing straight through seems closer to the intended behavior.

RE: "Instead of just dropping invalid characters one can replace it with "invalid character" symbol." - Agreed

TragicWarrior commented 4 years ago

@lwi , I've wanted to implement a generic exception handler for a while in the library for a while now. This might be an opportune time to do so. Something like:

vterm_exceptions.c

int vterm_exception(vterm_t *vterm, int type, void *anything)

Where type could be:

enum
{
    VTERM_EX_C1_RAW,
    VTERM_EX_UTF8_MISSING_MARKER,
    VTERM_EX_UNHANDLED_CSI,
    VTERM_EX_SETCHAR_FAILED,
}

etc, etc...

The original idea was to clean up and consolidate where error handling is sprinkled throughout the code and make it all more uniform.

In the case of UTF8 decoding, the logic could be put in here.

lwi commented 4 years ago

strict - In my understanding on this, invalid utf8 (continuation bytes with no marker) are supposed to be treated as Windows-1252 (as opposed to being dropped). I'm not sure about unencoded C1 bytes being dropped--it seems like that's the right thing to do, because, afaik, they're supposed to be encoded--did you observe otherwise with Fedora 31 though?

The Fedora bug report suggested that some terminal emulators infact do "encode UTF-8 first" and therefore drop unencoded C1 control sequences. In that "UTF-8 mode" they would also drop other invalid bytes (with highest bit set, like Windows-1252 or latin-1). If we were to implement such a mode it would mimic those emulators.

strict_except_c1 - Does this really happen (per my comment above regarding Fedora 31)? Are you aware of common happenstances for this?

I was just playing the theory of whats possible. I am not aware of specific implementations. However it is quite easy to add as option (for us).

lazy - On this one, again, afaik, non-compliant continuation bytes should be regarded as Windows-1252 so passing straight through seems closer to the intended behavior.

Well, yes and no. Normally one does not mix UTF-8 and non UTF-8 (like Windows-1252). However in this mode we could support 8-bit encodings partly in UTF-8 mode (which might be chosen incorrectly).

TragicWarrior / libvterm

Make support for UTF-8 configurable #155