Closed whorfin closed 2 years ago
Screenshot:
Text file: busted-extension.txt
Is this an issue that you simply bumped into when pasting things into your editor, or you actually want non-UTF8 support?
FWIW I doubt Thunderbird can actually send it out even if I give it the original contents. I pasted the example text into Thunderbird editor directly and sent it out, and Thunderbird just filtered them out seemingly. So I wonder if it's good enough for me to do something similar?
I have received emails from others which cause the extension to fail when I attempt to reply to these emails. These users are sending non-UTF8 characters, for some reason. It seems to be related to some kind of fancy formatting.
It is not appropriate for me to include their emails in a bug report, so I went to some trouble to find a minimal reproduction case. In fact, this 0x95
character was in one of these emails.
I tried a few quick tests, and it appears Thunderbird might be sanitizing the clipboard to Unicode - when I paste that content into Thunderbird and then read it back, I see the 0x95
was transformed to 0xc295
.
Ideally, using the external editor would not alter the original text - this is particularly relevant when _reply_ing to a message containing such characters.
I have found a work-around of sorts, where by forcing utf-8 encoding in my text editor when it detects an .eml
file, I can then get the extension to work, though correspondents who care will note that my reply has subtly altered their message.
With that in mind, it seems it would be reasonable for the extension to do something similar, perhaps reading the .eml
file to bytes and then employing String::from_utf8_lossy()
?
Cheers
You had a good point regarding cited texts, so I did a bit more testing.
// in debug console
// ExtEditorR sending: Object { configuration: {…}, tab: {…}, composeDetails: {…} }
// saving above as temp0
var composeDetails = temp0.composeDetails
composeDetails.plainTextBody = atob('PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICCVDQo=') // base64 of your example text
btoa(composeDetails.plainTextBody)
// "PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICCVDQo="
messenger.compose.setComposeDetails(temp0.tab.id, composeDetails)
...then I sent it out.
However in Sent (and Inbox), the email became:
--------------lcR62x9ounxJKfZP9FORRnwb
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64
PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICDClQ0K
--------------lcR62x9ounxJKfZP9FORRnwb--
$ printf 'PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICDClQ0K' | base64 -d | uniname
No LINES variable in environment so unable to determine lines per page.
Using default of 24.
character byte UTF-32 encoded as glyph name
0 0 00003E 3E > GREATER-THAN SIGN
1 1 000020 20 SPACE
2 2 000074 74 t LATIN SMALL LETTER T
3 3 000068 68 h LATIN SMALL LETTER H
4 4 000069 69 i LATIN SMALL LETTER I
5 5 000073 73 s LATIN SMALL LETTER S
6 6 000020 20 SPACE
7 7 000063 63 c LATIN SMALL LETTER C
8 8 000068 68 h LATIN SMALL LETTER H
9 9 000061 61 a LATIN SMALL LETTER A
10 10 000072 72 r LATIN SMALL LETTER R
11 11 000061 61 a LATIN SMALL LETTER A
12 12 000063 63 c LATIN SMALL LETTER C
13 13 000074 74 t LATIN SMALL LETTER T
14 14 000065 65 e LATIN SMALL LETTER E
15 15 000072 72 r LATIN SMALL LETTER R
16 16 000020 20 SPACE
17 17 000063 63 c LATIN SMALL LETTER C
18 18 000061 61 a LATIN SMALL LETTER A
19 19 000075 75 u LATIN SMALL LETTER U
20 20 000073 73 s LATIN SMALL LETTER S
21 21 000065 65 e LATIN SMALL LETTER E
character byte UTF-32 encoded as glyph name
22 22 000073 73 s LATIN SMALL LETTER S
23 23 000020 20 SPACE
24 24 000062 62 b LATIN SMALL LETTER B
25 25 000061 61 a LATIN SMALL LETTER A
26 26 000064 64 d LATIN SMALL LETTER D
27 27 00006E 6E n LATIN SMALL LETTER N
28 28 000065 65 e LATIN SMALL LETTER E
29 29 000073 73 s LATIN SMALL LETTER S
30 30 000073 73 s LATIN SMALL LETTER S
31 31 00003A 3A : COLON
32 32 00000D 0D CARRIAGE RETURN (CR)
33 33 00000A 0A LINE FEED (LF)
34 34 00003E 3E > GREATER-THAN SIGN
35 35 000020 20 SPACE
36 36 000020 20 SPACE
37 37 000020 20 SPACE
38 38 000095 C2 95 MESSAGE WAITING
39 40 00000D 0D CARRIAGE RETURN (CR)
40 41 00000A 0A LINE FEED (LF)
Same thing happened after I disabled PGP (it didn't use base64 in email source but still sanitised the body):
416 416 00003E 3E > GREATER-THAN SIGN
417 417 000020 20 SPACE
character byte UTF-32 encoded as glyph name
418 418 000074 74 t LATIN SMALL LETTER T
419 419 000068 68 h LATIN SMALL LETTER H
420 420 000069 69 i LATIN SMALL LETTER I
421 421 000073 73 s LATIN SMALL LETTER S
422 422 000020 20 SPACE
423 423 000063 63 c LATIN SMALL LETTER C
424 424 000068 68 h LATIN SMALL LETTER H
425 425 000061 61 a LATIN SMALL LETTER A
426 426 000072 72 r LATIN SMALL LETTER R
427 427 000061 61 a LATIN SMALL LETTER A
428 428 000063 63 c LATIN SMALL LETTER C
429 429 000074 74 t LATIN SMALL LETTER T
430 430 000065 65 e LATIN SMALL LETTER E
431 431 000072 72 r LATIN SMALL LETTER R
432 432 000020 20 SPACE
433 433 000063 63 c LATIN SMALL LETTER C
434 434 000061 61 a LATIN SMALL LETTER A
435 435 000075 75 u LATIN SMALL LETTER U
436 436 000073 73 s LATIN SMALL LETTER S
437 437 000065 65 e LATIN SMALL LETTER E
438 438 000073 73 s LATIN SMALL LETTER S
439 439 000020 20 SPACE
character byte UTF-32 encoded as glyph name
440 440 000062 62 b LATIN SMALL LETTER B
441 441 000061 61 a LATIN SMALL LETTER A
442 442 000064 64 d LATIN SMALL LETTER D
443 443 00006E 6E n LATIN SMALL LETTER N
444 444 000065 65 e LATIN SMALL LETTER E
445 445 000073 73 s LATIN SMALL LETTER S
446 446 000073 73 s LATIN SMALL LETTER S
447 447 00003A 3A : COLON
448 448 00000D 0D CARRIAGE RETURN (CR)
449 449 00000A 0A LINE FEED (LF)
450 450 00003E 3E > GREATER-THAN SIGN
451 451 000020 20 SPACE
452 452 000020 20 SPACE
453 453 000020 20 SPACE
454 454 000095 C2 95 MESSAGE WAITING
455 456 00000D 0D CARRIAGE RETURN (CR)
456 457 00000A 0A LINE FEED (LF)
I was thinking that I could base64 the body and send it over to Thunderbird. But unfortunately as you can see, even if I did that, eventually you still wouldn't have the original unsanitised body. (I'm actually a little curious what email client the sender used. I tried git send-email
and even Git sanitises texts.)
I also tested replying to such an email with invalid UTF-8 via EER, just to make sure it doesn't send EER host some JSON that Rust refuses to decode. And it turned out the moment Thunderbird cited the email in its editor, the text had already been sanitised, so EER doesn't need to worry about this (also means it's impossible for EER to get the original email).
So I'll go with from_utf8_lossy()
as you mentioned. If things ever change I'm happy to touch up the implementation.
Thank you very much for your diligence.
To answer your curiosity, I admit my first thought was "I'll bet they used Outlook" because the formatting looked very Office-y
I went back and checked one set of headers and yes that appears to be the case - I see lots of X-MS-Exchange
headers, and references to outlook.com
and outlook.office365.com
domains.
UTF-8
seems like a safe and sane choice
Feel free to give the artefacts of #68 a shot when the build finishes :)
Did a quick test and, while it handles conversion to utf-8 differently than my text editor, "that's just Rust" - it certainly works and avoids the failure. 🎉
Description
Greetings! Thank you for making this fabulous extension, I've missed external editors in Thunderbird and this brings them back nicely, And thanks for fixing the CR+LF issue!
After some bizarre experiences with partially mangled emails I think I've managed to narrow down a bug involving non-ASCII file contents. I have correspondents from fancy organizations who like to use "clever" formatting, and some of these character combinations are recognized as "Non-ISO extended-ASCII text" (according to
file
).Keeping their text in a reply results, when lucky, in ExtEditorR throwing a "failed to process temporary file" message with the infamous Rust "stream did not contain valid UTF-8". I've seen other scenarios where my text is quietly truncated but suspect this is the root cause. I'm attaching a screencap of the message, along with a minimal text file which, if copy-pasted into eg a
vim
window editing a mail body, will throw the error.If I copy/paste that same text directly into the thunderbird message body, everything is fine.
This is likely a result of the file being
read_line()
'd into aString
, which are required to be UTF-8 in Rust. Now I fall off the edge of my Rust knowledge; I think a byte array (Vec
?) might be needed?Environment
external-editor-revived --version
output: External Editor Revived native messaging host for windows (x86_64) v0.4.1Configuration
Using gvim or vim, on Windows or Linux Note that depending on the fileencoding set, eg if utf-8 is forced, vim will "fix" the issue on ingest. Using a fileencoding of
latin1
ornone
such that the bytes are not altered, the error is easily seen.Logs
ExtEditorR sending:
Object { configuration: {…}, tab: {…}, composeDetails: {…} } background.js:115:11 ExtEditorR received:
Object { tab: {…}, title: "ExtEditorR failed to process temporary file", message: "stream did not contain valid UTF-8.\nYou can try recovering data from d:\tmp\external_editor_revived_3.eml" } background.js:125:11 stderr output from native app external_editor_revived: ExtEditorR failed to process temporary file: stream did not contain valid UTF-8. stderr output from native app external_editor_revived: You can try recovering data from d:\tmp\external_editor_revived_3.eml
Screenshots
I can't upload either screenshot or txt file at the moment, getting a "Something went really wrong"; i'll finish this then try to add those as further comments