Frederick888 / external-editor-revived

External Editor Revived is a Thunderbird MailExtension which allows editing emails in programs such as Vim, Neovim, Emacs, etc.
GNU General Public License v3.0
113 stars 6 forks source link

"stream did not contain valid UTF-8" error #65

Closed whorfin closed 2 years ago

whorfin commented 2 years ago

Description

Greetings! Thank you for making this fabulous extension, I've missed external editors in Thunderbird and this brings them back nicely, And thanks for fixing the CR+LF issue!

After some bizarre experiences with partially mangled emails I think I've managed to narrow down a bug involving non-ASCII file contents. I have correspondents from fancy organizations who like to use "clever" formatting, and some of these character combinations are recognized as "Non-ISO extended-ASCII text" (according to file).

Keeping their text in a reply results, when lucky, in ExtEditorR throwing a "failed to process temporary file" message with the infamous Rust "stream did not contain valid UTF-8". I've seen other scenarios where my text is quietly truncated but suspect this is the root cause. I'm attaching a screencap of the message, along with a minimal text file which, if copy-pasted into eg a vim window editing a mail body, will throw the error.

If I copy/paste that same text directly into the thunderbird message body, everything is fine.

This is likely a result of the file being read_line()'d into a String, which are required to be UTF-8 in Rust. Now I fall off the edge of my Rust knowledge; I think a byte array (Vec?) might be needed?

Environment

Configuration

Using gvim or vim, on Windows or Linux Note that depending on the fileencoding set, eg if utf-8 is forced, vim will "fix" the issue on ingest. Using a fileencoding of latin1 or none such that the bytes are not altered, the error is easily seen.

Logs

ExtEditorR sending:
Object { configuration: {…}, tab: {…}, composeDetails: {…} } background.js:115:11 ExtEditorR received:
Object { tab: {…}, title: "ExtEditorR failed to process temporary file", message: "stream did not contain valid UTF-8.\nYou can try recovering data from d:\tmp\external_editor_revived_3.eml" } background.js:125:11 stderr output from native app external_editor_revived: ExtEditorR failed to process temporary file: stream did not contain valid UTF-8. stderr output from native app external_editor_revived: You can try recovering data from d:\tmp\external_editor_revived_3.eml

Screenshots

I can't upload either screenshot or txt file at the moment, getting a "Something went really wrong"; i'll finish this then try to add those as further comments

whorfin commented 2 years ago

Screenshot: extension-troubles

Text file: busted-extension.txt

Frederick888 commented 2 years ago

Is this an issue that you simply bumped into when pasting things into your editor, or you actually want non-UTF8 support?

FWIW I doubt Thunderbird can actually send it out even if I give it the original contents. I pasted the example text into Thunderbird editor directly and sent it out, and Thunderbird just filtered them out seemingly. So I wonder if it's good enough for me to do something similar?

whorfin commented 2 years ago

I have received emails from others which cause the extension to fail when I attempt to reply to these emails. These users are sending non-UTF8 characters, for some reason. It seems to be related to some kind of fancy formatting. It is not appropriate for me to include their emails in a bug report, so I went to some trouble to find a minimal reproduction case. In fact, this 0x95 character was in one of these emails.

I tried a few quick tests, and it appears Thunderbird might be sanitizing the clipboard to Unicode - when I paste that content into Thunderbird and then read it back, I see the 0x95 was transformed to 0xc295.

Ideally, using the external editor would not alter the original text - this is particularly relevant when _reply_ing to a message containing such characters.

I have found a work-around of sorts, where by forcing utf-8 encoding in my text editor when it detects an .eml file, I can then get the extension to work, though correspondents who care will note that my reply has subtly altered their message.

With that in mind, it seems it would be reasonable for the extension to do something similar, perhaps reading the .eml file to bytes and then employing String::from_utf8_lossy()?

Cheers

Frederick888 commented 2 years ago

You had a good point regarding cited texts, so I did a bit more testing.

// in debug console
// ExtEditorR sending:  Object { configuration: {…}, tab: {…}, composeDetails: {…} }
// saving above as temp0
var composeDetails = temp0.composeDetails
composeDetails.plainTextBody = atob('PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICCVDQo=')  // base64 of your example text
btoa(composeDetails.plainTextBody)
// "PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICCVDQo="
messenger.compose.setComposeDetails(temp0.tab.id, composeDetails)

...then I sent it out.

However in Sent (and Inbox), the email became:

--------------lcR62x9ounxJKfZP9FORRnwb
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: base64

PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICDClQ0K

--------------lcR62x9ounxJKfZP9FORRnwb--
$ printf 'PiB0aGlzIGNoYXJhY3RlciBjYXVzZXMgYmFkbmVzczoNCj4gICDClQ0K' | base64 -d | uniname
No LINES variable in environment so unable to determine lines per page.
Using default of 24.
character  byte       UTF-32   encoded as     glyph   name
        0          0  00003E   3E             >      GREATER-THAN SIGN
        1          1  000020   20                     SPACE
        2          2  000074   74             t      LATIN SMALL LETTER T
        3          3  000068   68             h      LATIN SMALL LETTER H
        4          4  000069   69             i      LATIN SMALL LETTER I
        5          5  000073   73             s      LATIN SMALL LETTER S
        6          6  000020   20                     SPACE
        7          7  000063   63             c      LATIN SMALL LETTER C
        8          8  000068   68             h      LATIN SMALL LETTER H
        9          9  000061   61             a      LATIN SMALL LETTER A
       10         10  000072   72             r      LATIN SMALL LETTER R
       11         11  000061   61             a      LATIN SMALL LETTER A
       12         12  000063   63             c      LATIN SMALL LETTER C
       13         13  000074   74             t      LATIN SMALL LETTER T
       14         14  000065   65             e      LATIN SMALL LETTER E
       15         15  000072   72             r      LATIN SMALL LETTER R
       16         16  000020   20                     SPACE
       17         17  000063   63             c      LATIN SMALL LETTER C
       18         18  000061   61             a      LATIN SMALL LETTER A
       19         19  000075   75             u      LATIN SMALL LETTER U
       20         20  000073   73             s      LATIN SMALL LETTER S
       21         21  000065   65             e      LATIN SMALL LETTER E
character  byte       UTF-32   encoded as     glyph   name
       22         22  000073   73             s      LATIN SMALL LETTER S
       23         23  000020   20                     SPACE
       24         24  000062   62             b      LATIN SMALL LETTER B
       25         25  000061   61             a      LATIN SMALL LETTER A
       26         26  000064   64             d      LATIN SMALL LETTER D
       27         27  00006E   6E             n      LATIN SMALL LETTER N
       28         28  000065   65             e      LATIN SMALL LETTER E
       29         29  000073   73             s      LATIN SMALL LETTER S
       30         30  000073   73             s      LATIN SMALL LETTER S
       31         31  00003A   3A             :      COLON
       32         32  00000D   0D                     CARRIAGE RETURN (CR)
       33         33  00000A   0A                     LINE FEED (LF)
       34         34  00003E   3E             >      GREATER-THAN SIGN
       35         35  000020   20                     SPACE
       36         36  000020   20                     SPACE
       37         37  000020   20                     SPACE
       38         38  000095   C2 95                  MESSAGE WAITING
       39         40  00000D   0D                     CARRIAGE RETURN (CR)
       40         41  00000A   0A                     LINE FEED (LF)

Same thing happened after I disabled PGP (it didn't use base64 in email source but still sanitised the body):

      416        416  00003E   3E             >      GREATER-THAN SIGN
      417        417  000020   20                     SPACE
character  byte       UTF-32   encoded as     glyph   name
      418        418  000074   74             t      LATIN SMALL LETTER T
      419        419  000068   68             h      LATIN SMALL LETTER H
      420        420  000069   69             i      LATIN SMALL LETTER I
      421        421  000073   73             s      LATIN SMALL LETTER S
      422        422  000020   20                     SPACE
      423        423  000063   63             c      LATIN SMALL LETTER C
      424        424  000068   68             h      LATIN SMALL LETTER H
      425        425  000061   61             a      LATIN SMALL LETTER A
      426        426  000072   72             r      LATIN SMALL LETTER R
      427        427  000061   61             a      LATIN SMALL LETTER A
      428        428  000063   63             c      LATIN SMALL LETTER C
      429        429  000074   74             t      LATIN SMALL LETTER T
      430        430  000065   65             e      LATIN SMALL LETTER E
      431        431  000072   72             r      LATIN SMALL LETTER R
      432        432  000020   20                     SPACE
      433        433  000063   63             c      LATIN SMALL LETTER C
      434        434  000061   61             a      LATIN SMALL LETTER A
      435        435  000075   75             u      LATIN SMALL LETTER U
      436        436  000073   73             s      LATIN SMALL LETTER S
      437        437  000065   65             e      LATIN SMALL LETTER E
      438        438  000073   73             s      LATIN SMALL LETTER S
      439        439  000020   20                     SPACE
character  byte       UTF-32   encoded as     glyph   name
      440        440  000062   62             b      LATIN SMALL LETTER B
      441        441  000061   61             a      LATIN SMALL LETTER A
      442        442  000064   64             d      LATIN SMALL LETTER D
      443        443  00006E   6E             n      LATIN SMALL LETTER N
      444        444  000065   65             e      LATIN SMALL LETTER E
      445        445  000073   73             s      LATIN SMALL LETTER S
      446        446  000073   73             s      LATIN SMALL LETTER S
      447        447  00003A   3A             :      COLON
      448        448  00000D   0D                     CARRIAGE RETURN (CR)
      449        449  00000A   0A                     LINE FEED (LF)
      450        450  00003E   3E             >      GREATER-THAN SIGN
      451        451  000020   20                     SPACE
      452        452  000020   20                     SPACE
      453        453  000020   20                     SPACE
      454        454  000095   C2 95                  MESSAGE WAITING
      455        456  00000D   0D                     CARRIAGE RETURN (CR)
      456        457  00000A   0A                     LINE FEED (LF)

I was thinking that I could base64 the body and send it over to Thunderbird. But unfortunately as you can see, even if I did that, eventually you still wouldn't have the original unsanitised body. (I'm actually a little curious what email client the sender used. I tried git send-email and even Git sanitises texts.)

I also tested replying to such an email with invalid UTF-8 via EER, just to make sure it doesn't send EER host some JSON that Rust refuses to decode. And it turned out the moment Thunderbird cited the email in its editor, the text had already been sanitised, so EER doesn't need to worry about this (also means it's impossible for EER to get the original email).

So I'll go with from_utf8_lossy() as you mentioned. If things ever change I'm happy to touch up the implementation.

whorfin commented 2 years ago

Thank you very much for your diligence. To answer your curiosity, I admit my first thought was "I'll bet they used Outlook" because the formatting looked very Office-y I went back and checked one set of headers and yes that appears to be the case - I see lots of X-MS-Exchange headers, and references to outlook.com and outlook.office365.com domains.

UTF-8 seems like a safe and sane choice

Frederick888 commented 2 years ago

Feel free to give the artefacts of #68 a shot when the build finishes :)

whorfin commented 2 years ago

Did a quick test and, while it handles conversion to utf-8 differently than my text editor, "that's just Rust" - it certainly works and avoids the failure. 🎉