Open The-Compiler opened 1 month ago
61f8a1d2865aa9c2fe9889a8b1aee82bc82762a0 is the first bad commit
commit 61f8a1d2865aa9c2fe9889a8b1aee82bc82762a0 (HEAD)
Author: Simon Chopin <chopin.simon@gmail.com>
Date: Tue Apr 9 15:16:56 2024 +0200
compose: directly load the draft into the final EmailMessage object
This simplifies the code as we don't have to copy the data over from the
temporary Message object. The header copying in particular was failing
when using multiline headers as allowed by RFC 822.
I'm guessing we *could* get into trouble with encoding though? Since I'm
using UTF-8 everywhere it works fine for me.
cc @laarmen
Ah, I see why I didn't encounter the issue: my sent_dir is None.
You have to give me props for prediction issues with encoding though :grin: I'll have a look.
My conclusion is that the state of the Python stack is... unfortunate.
If you're using the new shiny APIs of EmailParser
such as set_content
(which is a wrapper around policy.content_manager.set_content
), it will default to UTF-8 encoding and set everything up in the msg internals to have it work properly, including converting the payload itself to UTF-8 (as in "transform the non-ASCII parts to their UTF-8 representation and then just use that as unicode codepoints" which is extremely weird). That's what we want.
However, we're using email.message_from_string
, which boils down to email.feedparser.FeedParser. Sadly, even though it takes a policy object, it doesn't use its content manager to set the content of the email, but rather uses the old API set_payload
which is much dumber.
Several ways forward:
1/ Go backwards and revert my patch (which would be bad for multiline header handling)
2/ use eml.set_charset('utf-8')
which is kinda deprecated and is definitely an old-style API. It will result in a base64-encoded email, but at least you'll be able to store it in your maildir. Small victories, right?
3/ Manually set the encoding via eml.set_param("charset", "utf-8")
, set the Content-Transfer-Encoding to '8bit', and do eml.as_string().encode('utf8')
when storing it (completely untested)
4/ Patch CPython's FeedParser to use the content manager when available to set the payload, which sounds like fun but is more of a long-term thing.
I propose to construct the mail as shown below, which results in a properly utf8 encoded byte representation as well as a clean 7bit version. (which will be base64 encoded if non-ascii characters are present)
I believe it is very similar to how it was before, but with a slightly more elegant treatment of the headers and avoiding deprecated methods.
Basically we use the message_from_string method to parse the headers and separate them from the body. These parts are then used to construct a clean new message.
import email
import email.utils
import email.policy
import mailbox
message = '''From: é mail <email@example.org>
To: ça va: <cava@example.org>
Subject: test ô utf8
Received: dual header 1
Received: dual header 2
Frédérique definiëren
'''
eml = email.message_from_string(message, policy=email.policy.EmailPolicy(utf8=False))
em = email.message.EmailMessage(policy=eml.policy.clone())
# Transfer the headers. This also works with duplicate keys
for h, v in eml.items():
em.add_header(h,v)
# Delete the headers from the original message
# Aim is the use the as_string() method on eml in order to obtain a properly utf8 decoded
# string from the message content. (get_content() does not produce that)
for k in eml.keys():
del(eml[k])
# Transfer the content from the original message
# Strip the leading line feed
em.set_content(eml.as_string().lstrip('\n'), cte='8bit')
print("\n- bytes representation, for example for storing on disk in a human readable form:")
print(em.as_bytes())
print("\n- decode it if you want it in as a string:")
print(em.as_bytes().decode())
print("\n- 7bit representation can be obtained using .as_string():")
print(em.as_string())
m = mailbox.MaildirMessage(em.as_bytes())
key = mailbox.Maildir('test/mail').add(m)
This results in
$ python testmail.py
- bytes representation, for example for storing on disk in a human readable form:
b'From: =?utf-8?q?=C3=A9?= mail <email@example.org>\nTo: =?utf-8?q?=C3=A7a?= va: cava@example.org;\nSubject: test =?utf-8?q?=C3=B4?= utf8\nReceived: dual header 1\nReceived: dual header 2\nContent-Type: text/plain; charset="utf-8"\nContent-Transfer-Encoding: 8bit\nMIME-Version: 1.0\n\nFr\xc3\xa9d\xc3\xa9rique defini\xc3\xabren\n'
- decode it if you want it in as a string:
From: =?utf-8?q?=C3=A9?= mail <email@example.org>
To: =?utf-8?q?=C3=A7a?= va: cava@example.org;
Subject: test =?utf-8?q?=C3=B4?= utf8
Received: dual header 1
Received: dual header 2
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Frédérique definiëren
- 7bit representation can be obtained using .as_string():
From: =?utf-8?q?=C3=A9?= mail <email@example.org>
To: =?utf-8?q?=C3=A7a?= va: cava@example.org;
Subject: test =?utf-8?q?=C3=B4?= utf8
Received: dual header 1
Received: dual header 2
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
MIME-Version: 1.0
RnLDqWTDqXJpcXVlIGRlZmluacOrcmVuCg==
When creating a message with
hellö
in it and attempting to send it, Dodo fails with:This is a regression as far as I know, will try to bisect later.
I seem to be able to trivially reproduce this in Python:
but it's not quite clear to me why this happens. From what I understand,
message_from_string
might be the wrong thing to use here entirely, since that already expects a properly encoded .eml-like file?