`UnicodeEncodeError` when saving non-ASCII message

The-Compiler commented 1 month ago

When creating a message with hellö in it and attempting to send it, Dodo fails with:

Traceback (most recent call last):
  File "/home/florian/proj/dodo/dodo/compose.py", line 445, in run
    m = mailbox.MaildirMessage(eml.as_bytes())
                               ^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/email/message.py", line 208, in as_bytes
    g.flatten(self, unixfrom=unixfrom)
  File "/usr/lib/python3.12/email/generator.py", line 117, in flatten
    self._write(msg)
  File "/usr/lib/python3.12/email/generator.py", line 182, in _write
    self._dispatch(msg)
  File "/usr/lib/python3.12/email/generator.py", line 219, in _dispatch
    meth(msg)
  File "/usr/lib/python3.12/email/generator.py", line 446, in _handle_text
    super(BytesGenerator,self)._handle_text(msg)
  File "/usr/lib/python3.12/email/generator.py", line 263, in _handle_text
    self._write_lines(payload)
  File "/usr/lib/python3.12/email/generator.py", line 156, in _write_lines
    self.write(line)
  File "/usr/lib/python3.12/email/generator.py", line 420, in write
    self._fp.write(s.encode('ascii', 'surrogateescape'))
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'ascii' codec can't encode characters in position 2-3: ordinal not in range(128)

This is a regression as far as I know, will try to bisect later.

I seem to be able to trivially reproduce this in Python:

import email
email.message_from_string("\n\nhellö").as_bytes()

but it's not quite clear to me why this happens. From what I understand, message_from_string might be the wrong thing to use here entirely, since that already expects a properly encoded .eml-like file?

The-Compiler commented 1 month ago

61f8a1d2865aa9c2fe9889a8b1aee82bc82762a0 is the first bad commit

commit 61f8a1d2865aa9c2fe9889a8b1aee82bc82762a0 (HEAD)
Author: Simon Chopin <chopin.simon@gmail.com>
Date:   Tue Apr 9 15:16:56 2024 +0200

    compose: directly load the draft into the final EmailMessage object

    This simplifies the code as we don't have to copy the data over from the
    temporary Message object. The header copying in particular was failing
    when using multiline headers as allowed by RFC 822.

    I'm guessing we *could* get into trouble with encoding though? Since I'm
    using UTF-8 everywhere it works fine for me.

cc @laarmen

laarmen commented 1 month ago

Ah, I see why I didn't encounter the issue: my sent_dir is None.

You have to give me props for prediction issues with encoding though :grin: I'll have a look.

laarmen commented 1 month ago

My conclusion is that the state of the Python stack is... unfortunate.

If you're using the new shiny APIs of EmailParser such as set_content (which is a wrapper around policy.content_manager.set_content), it will default to UTF-8 encoding and set everything up in the msg internals to have it work properly, including converting the payload itself to UTF-8 (as in "transform the non-ASCII parts to their UTF-8 representation and then just use that as unicode codepoints" which is extremely weird). That's what we want.

However, we're using email.message_from_string, which boils down to email.feedparser.FeedParser. Sadly, even though it takes a policy object, it doesn't use its content manager to set the content of the email, but rather uses the old API set_payload which is much dumber.

Several ways forward: 1/ Go backwards and revert my patch (which would be bad for multiline header handling) 2/ use eml.set_charset('utf-8') which is kinda deprecated and is definitely an old-style API. It will result in a base64-encoded email, but at least you'll be able to store it in your maildir. Small victories, right? 3/ Manually set the encoding via eml.set_param("charset", "utf-8"), set the Content-Transfer-Encoding to '8bit', and do eml.as_string().encode('utf8') when storing it (completely untested) 4/ Patch CPython's FeedParser to use the content manager when available to set the payload, which sounds like fun but is more of a long-term thing.

hbog commented 3 weeks ago

I propose to construct the mail as shown below, which results in a properly utf8 encoded byte representation as well as a clean 7bit version. (which will be base64 encoded if non-ascii characters are present)

I believe it is very similar to how it was before, but with a slightly more elegant treatment of the headers and avoiding deprecated methods.

Basically we use the message_from_string method to parse the headers and separate them from the body. These parts are then used to construct a clean new message.

import email
import email.utils
import email.policy
import mailbox

message = '''From: é mail <email@example.org>
To: ça va: <cava@example.org>
Subject: test ô utf8
Received: dual header 1
Received: dual header 2

Frédérique definiëren
'''

eml = email.message_from_string(message, policy=email.policy.EmailPolicy(utf8=False))

em = email.message.EmailMessage(policy=eml.policy.clone())

# Transfer the headers. This also works with duplicate keys
for h, v in eml.items():
  em.add_header(h,v)

# Delete the headers from the original message
# Aim is the use the as_string() method on eml in order to obtain a properly utf8 decoded
# string from the message content. (get_content() does not produce that)
for k in eml.keys():
  del(eml[k])

# Transfer the content from the original message
# Strip the leading line feed
em.set_content(eml.as_string().lstrip('\n'), cte='8bit')

print("\n- bytes representation, for example for storing on disk in a human readable form:")
print(em.as_bytes())
print("\n- decode it if you want it in as a string:")
print(em.as_bytes().decode())

print("\n- 7bit representation can be obtained using .as_string():")
print(em.as_string())

m = mailbox.MaildirMessage(em.as_bytes())
key = mailbox.Maildir('test/mail').add(m)

This results in

$ python testmail.py 

- bytes representation, for example for storing on disk in a human readable form:
b'From: =?utf-8?q?=C3=A9?= mail <email@example.org>\nTo: =?utf-8?q?=C3=A7a?= va: cava@example.org;\nSubject: test =?utf-8?q?=C3=B4?= utf8\nReceived: dual header 1\nReceived: dual header 2\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 8bit\nMIME-Version: 1.0\n\nFr\xc3\xa9d\xc3\xa9rique defini\xc3\xabren\n'

- decode it if you want it in as a string:
From: =?utf-8?q?=C3=A9?= mail <email@example.org>
To: =?utf-8?q?=C3=A7a?= va: cava@example.org;
Subject: test =?utf-8?q?=C3=B4?= utf8
Received: dual header 1
Received: dual header 2
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0

Frédérique definiëren

- 7bit representation can be obtained using .as_string():
From: =?utf-8?q?=C3=A9?= mail <email@example.org>
To: =?utf-8?q?=C3=A7a?= va: cava@example.org;
Subject: test =?utf-8?q?=C3=B4?= utf8
Received: dual header 1
Received: dual header 2
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0

RnLDqWTDqXJpcXVlIGRlZmluacOrcmVuCg==

akissinger / dodo

`UnicodeEncodeError` when saving non-ASCII message #73