Handling invalid UTF-8 bytes

alacritty / vte

Parser for virtual terminal emulators

https://docs.rs/vte/

Apache License 2.0

242 stars 56 forks source link

Handling invalid UTF-8 bytes #38

Open sunfishcode opened 4 years ago

sunfishcode commented 4 years ago

I'm looking at using vte for a use case where I want to translate invalid UTF-8 bytes into Unicode replacement characters, however vte seem to silently swallow some invalid UTF-8 bytes. For example, if I feed it input consisting of the byte 0x90, it produces no events.

Would it make sense to add Execute rules to the Ground table for 0x90 and other formerly special C1 codes?

Would it make sense to introduce something like a InvalidUtf8 action, to fill in the Ground table in general?

chrisduerr commented 4 years ago

Non-utf8 8-bit C1 escapes should be passed to execute, so you should be able to handle C1 codes if that's your issue?

sunfishcode commented 4 years ago

Here's a more specific testcase:

$ echo -e '\x90' > test.txt
$ target/debug/examples/parselog < test.txt
[execute] 0a
$

The 0x90 byte is silently dropped with no execute or any other action.

chrisduerr commented 4 years ago

\x90 is an escape introducer, which is stripped for security based on my understanding of the code.

So escapes like \x85 will emit an execute, but the DCS(x90)/CSI(x9b)/OSC(x9d) 8-bit escapes are ignored.

sunfishcode commented 4 years ago

I don't actually want to interpret C1 controls in my use case; I want to replace all non-UTF-8 bytes into replacement characters.

Right now, vte doesn't support that, either for bytes like 0x90 which are C1 controls, or bytes like 0xfd which are not. Is this a use case vte is interested in supporting?

chrisduerr commented 4 years ago

Is this a use case vte is interested in supporting?

I'm not sure if it's possible to support that without removing existing functionality.

Take things like the NEL non-utf8 8-bit C1 escape \x85. We trigger the execute function for that with this byte attached. So it's a valid escape that we propagate upstream for handling. So it's not actually invalid at all.

You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.

sunfishcode commented 4 years ago

You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.

Yes, that's what I want to do. It's ok if vte reports these bytes through execute or a new invalid hook or some other hook. I just want to know when these bytes happen so that I know when to emit replacement characters.

Specifically, I want to do this for both C1 codes like 0x90, and non-C1 codes like 0xfd. I can cope if these two cases are reported differently, and it's even ok if the API doesn't tell me what the actual bytes are, as long as it provides indications that such bytes were processed.

chrisduerr commented 4 years ago

For actually invalid UTF-8, we already print error glyphs (see echo -e "\xc2\xc2"). So as far as I can tell we'd probably just need to make sure that bytes that are ignored right now are somehow propagated (like C1 DCS/CSI/OSC).

For these specific bytes it would be possible to propagate them to the execute function without actually handling them, though I'm not sure about other things like 0xfd, I'd have to look into that myself.