alacritty / vte

Parser for virtual terminal emulators
https://docs.rs/vte/
Apache License 2.0
242 stars 56 forks source link

Process all non-ASCII bytes with the UTF-8 parser. #58

Closed sunfishcode closed 4 years ago

sunfishcode commented 4 years ago

The UTF-8 parser knows how to handle invalid byte sequences, so don't preprocess the input in the main parser; just hand any non-ASCII byte to the UTF-8 parser to handle.

This includes what were previously interpreted as 8-bit C1 control codes; they are now interpreted as UTF-8 continuation characters.

This is the first in a series of patches which together fix https://github.com/alacritty/vte/issues/38 and a few related issues, in total enabling vte to process non-UTF-8 bytes in a manner consistent with Rust's String::from_utf8_lossy.

Removing support for the 8-bit C1 codes isn't necessary to achieve this, and I can revise this patch to restore support if desired, but removing them does simplify the table and the API. And the 8-bit C1 codes are rarely used and not well supported anyway. vte already doesn't appear to recognize 8-bit DCS, CSI, OSC, PM, or APC, Alacritty doesn't appear to intepret 8-bit NEL, SS2, or SS3, and the remaining 8-bit C1 codes are rarely used.

sunfishcode commented 4 years ago

I've now posted #60 with the full series this is part of.

The motivation for this is that from a UTF-8 perspective, 8-bit C1 control codes are continuation bytes, and if they appear alone, they form invalid UTF-8 sequences. Passing them to execute allows users to emit replacement characters themselves, but it's a little awkward if vte translates some invalid sequences to replacement characters and not others. It also turns out to be tricky to emit the expected number of replacement characters from execute because vte is processing some of the bytes and not others. Since vte is emitting replacement characters for some invalid sequences, and has most of the logic needed to determine how many replacement characters to emit, the simplest approach is just to have vte emit the replacement characters.

chrisduerr commented 4 years ago

Closing in favor of #60, I'll take a look once that is in a reviewable state.