Open sunfishcode opened 4 years ago
Non-utf8 8-bit C1 escapes should be passed to execute
, so you should be able to handle C1 codes if that's your issue?
Here's a more specific testcase:
$ echo -e '\x90' > test.txt
$ target/debug/examples/parselog < test.txt
[execute] 0a
$
The 0x90 byte is silently dropped with no execute
or any other action.
\x90
is an escape introducer, which is stripped for security based on my understanding of the code.
So escapes like \x85
will emit an execute
, but the DCS
(x90
)/CSI
(x9b
)/OSC
(x9d
) 8-bit escapes are ignored.
I don't actually want to interpret C1 controls in my use case; I want to replace all non-UTF-8 bytes into replacement characters.
Right now, vte doesn't support that, either for bytes like 0x90 which are C1 controls, or bytes like 0xfd which are not. Is this a use case vte is interested in supporting?
Is this a use case vte is interested in supporting?
I'm not sure if it's possible to support that without removing existing functionality.
Take things like the NEL
non-utf8 8-bit C1 escape \x85
. We trigger the execute
function for that with this byte attached. So it's a valid escape that we propagate upstream for handling. So it's not actually invalid at all.
You could just handle C1
escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.
You could just handle C1 escapes in your application by printing the missing glyph symbol, would that be reasonable? As far as I can tell, all that would be required then would be to make them all available appropriately.
Yes, that's what I want to do. It's ok if vte reports these bytes through execute
or a new invalid
hook or some other hook. I just want to know when these bytes happen so that I know when to emit replacement characters.
Specifically, I want to do this for both C1 codes like 0x90, and non-C1 codes like 0xfd. I can cope if these two cases are reported differently, and it's even ok if the API doesn't tell me what the actual bytes are, as long as it provides indications that such bytes were processed.
For actually invalid UTF-8, we already print error glyphs (see echo -e "\xc2\xc2"
). So as far as I can tell we'd probably just need to make sure that bytes that are ignored right now are somehow propagated (like C1 DCS/CSI/OSC).
For these specific bytes it would be possible to propagate them to the execute
function without actually handling them, though I'm not sure about other things like 0xfd
, I'd have to look into that myself.
I'm looking at using vte for a use case where I want to translate invalid UTF-8 bytes into Unicode replacement characters, however vte seem to silently swallow some invalid UTF-8 bytes. For example, if I feed it input consisting of the byte 0x90, it produces no events.
Would it make sense to add
Execute
rules to theGround
table for 0x90 and other formerly special C1 codes?Would it make sense to introduce something like a
InvalidUtf8
action, to fill in theGround
table in general?