anymail / django-anymail

Django email backends and webhooks for Amazon SES, Brevo (Sendinblue), MailerSend, Mailgun, Mailjet, Postmark, Postal, Resend, SendGrid, SparkPost, Unisender Go and more
https://anymail.dev
BSD 3-Clause "New" or "Revised" License
1.65k stars 125 forks source link

Bad UTF-8 "To" header encoding? #369

Closed andresmrm closed 2 months ago

andresmrm commented 4 months ago

Hi!

I'm getting errors when trying to send messages when the "to" header has non-ASCII chars. The problem seems to happen when all these conditions are true at the same time:

  1. the readable name part of the recipient is big (e.g. "A very long and big name for this recipient" to@example.com)
  2. there is at least one non-ASCII char
  3. there is at least one "special char" (only seems to happen with comma or parenthesis)
  4. the special char isn't close to or between no-ASCII chars

I will use the related test to better explain:

Description Input Output Error?
Current test "Người nhận" <to@example.com> To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= <to@example.com> No
Long "Người nhận a very very long name" <to@example.com> To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long name <to@example.com> No
With comma "Người nhận a very very long, name" <to@example.com> To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com> YES
Comma between non-ASCII "Người nhận a very very long, náme" <to@example.com> To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcsIG7DoW1l?=\n <to@example.com> No
Comma near non-ASCII "Người nhận, name" <to@example.com> To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uLCBuYW1l?= <to@example.com> No

So, if we have a UTF-8 encoded part and a special char, the special char must also be encoded, but currently this is only happening if the special char is close or between non-ASCII chars.

Commenting this line seems to encode the entire name in these cases, solving the problem. But I don't know if this has other unwanted consequences.

Example traceback:

File ~/.local/lib/python3.11/site-packages/anymail/backends/amazon_ses.py:78, in EmailBackend._send(self, message)
     76 def _send(self, message):
     77     if self.client:
---> 78         return super()._send(message)
     79     elif self.fail_silently:
     80         # (Probably missing boto3 credentials in open().)
     81         return False

File ~/.local/lib/python3.11/site-packages/anymail/backends/base.py:147, in AnymailBaseBackend._send(self, message)
    144     return False
    146 payload = self.build_message_payload(message, self.send_defaults)
--> 147 response = self.post_to_esp(payload, message)
    148 message.anymail_status.esp_response = response
    150 recipient_status = self.parse_recipient_status(response, payload, message)

File ~/.local/lib/python3.11/site-packages/anymail/backends/amazon_ses.py:114, in EmailBackend.post_to_esp(self, payload, message)
    110     response = client_send_api(**payload.params)
    111 except BOTO_BASE_ERRORS as err:
    112     # ClientError has a response attr with parsed json error response
    113     # (other errors don't)
--> 114     raise AnymailAPIError(
    115         str(err),
    116         backend=self,
    117         email_message=message,
    118         payload=payload,
    119         response=getattr(err, "response", None),
    120     ) from err
    121 return response

AnymailAPIError: An error occurred (BadRequestException) when calling the SendEmail operation: Local address contains control or whitespace
botocore.errorfactory.BadRequestException: An error occurred (BadRequestException) when calling the SendEmail operation: Local address contains control or whitespace
medmunds commented 4 months ago

Thanks for the report and the detailed analysis. I'm able to reproduce this and am investigating. [I hope you don't mind, I edited your report to format the input and output columns as code, because GitHub was hiding important characters essential to understanding the problem.]

It looks like you've uncovered a bug in Python's email package, related to incorrectly "folding" address headers that are too long, when using Content-Transfer-Encoding (CTE) 7bit, and if the headers need "encoded words" and also contain "special characters." I haven't been able to locate a Python bug report for this exact problem, but a similar issue with shorter address headers that didn't require folding (https://github.com/python/cpython/issues/81663) was fixed in Python 3.8. I'm guessing they missed the folding case.

Anymail's SES backend needs to use CTE 7bit (the line of code you identified), because Amazon SES doesn't officially support 8bit CTE, and using it can result in mojibake depending on what other SES options are enabled. (See the comments above that code and Anymail issue #115.)

I'll look into workarounds…

andresmrm commented 4 months ago

Thanks for the quick and complete response, and for improving the table in my post!

I'll adapt the test case so it can be reproduced without Anymail and report at CPython.

About the workarounds, it seemed hard to fix the issue without monkey patching, something I would like to avoid since sending email is a crucial part of my application. So, for now, I was just going to remove commas and parenthesis from the addresses...

medmunds commented 4 months ago

Hmm, looking into this some more, I think it's actually a Django bug. And there's a reasonable workaround Anymail could implement.

Python's email.message.EmailMessage class handles the header correctly. But I see the bug when using django.core.mail.EmailMessage:

import email.message
import email.policy
import django.core.mail

to = '"Người nhận a very very long, name" <to@example.com>'
policy = email.policy.default.clone(cte_type="7bit")

# Python's EmailMessage class doesn't exhibit bug
msg1 = email.message.EmailMessage()
msg1["To"] = to
print(msg1.as_bytes(policy=policy).decode("ascii"))
# To:
#   =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcs?= name <to@example.com>

# Django's EmailMessage class has bug
msg2 = django.core.mail.EmailMessage(to=[to]).message()
msg2.policy = policy
print(msg2.as_bytes().decode("ascii"))
# [... other headers ...]
# To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>

Django's EmailMessage.message() builds a Python legacy compatibility email.message.Message (wrapped as a Django SafeMIMEText). I believe the problem starts when django.core.mail.message.sanitize_address calls Header().encode() without specifying a maxlinelen for the header. This results in a single, very long encoded word for the entire display name. Python tries to refold it while serializing the message, but bugs in the legacy code introduce the error. (I think.) (So technically, this may be a Python bug, but it's in legacy code that won't be fixed. Also, it's probably related to django#31784.)

I think Anymail should just stop using Django's EmailMessage.message(), and instead build its own modern email.message.EmailMessage object directly from the Django EmailMessage. There are a lot of problems in the old Python Message code that were fixed in Python ~3.3-3.5's email revamp. (I also think Django should update to the newer class, too, but that's a different discussion.)

Before switching, we'll need to investigate whether there's anything important in Django's SafeMIME classes we'd be losing or need to copy. I suspect a lot of SafeMIME is there solely to work around bugs and security concerns in Python's legacy Message code—issues that don't apply to Python's modern EmailMessage. But I haven't really looked through that part of Django's mail package in a while.

andresmrm commented 4 months ago

Last Saturday I couldn't reproduce the bug using only Python's email pkg. So I went back trying to isolate it, and it seems to be in sanitize_address (as you said). But I hadn't the time to finish my analysis and report back here...

I investigated a bit more now and it really seems a conflict between Django and Python behavior, as you said. Django's forbid_multi_line_headers (sanitize_address is called by it) already does the encoding. This:

"A véry long name with non-ASCII char and, comma"

Becomes this:

=?utf-8?q?A_v=C3=A9ry_long_name_with_non-ASCII_char_and=2C_comma?=

But later, generator.BytesGenerator.flatten reencodes it wrongly:

A =?utf-8?q?v=C3=A9ry?= long name with non-ASCII char and, comma

If we comment the forbid_multi_line_headers line, letting the flatten handle the original string, it gives this:

A =?utf-8?q?v=C3=A9ry_long_name_with_non-ASCII_char_and=2C?= comma

What seems valid. Even if I really increase the size of the string, so the non-ASCII and the comma stay in different lines, it still works:

A =?utf-8?q?v=C3=A9ry?= long long long long long long long long long long\n long long long long long long long long long long long long long long long\n long long long long name with non-ASCII char =?utf-8?q?and=2C?= comma

So the problem only happens when both functions try to encode the string. That said, they encode it a bit differently: the first encodes the entire name, the second only the required parts. Not sure if this is relevant.

Should I report it both at Python and Django?

medmunds commented 4 months ago

Should I report it both at Python and Django?

I'd report it to Django only. I'm pretty sure the problem only occurs with Python email's legacy email.message.Message, which uses a different folding algorithm (via email.policy.compat32) than modern email.message.EmailMessage (email.policy.default) uses.

My understanding is the Compat32 legacy layer is there specifically to replicate Python 2's email behavior (including any bugs), so there's not much point in reporting bugs against it.

andresmrm commented 4 months ago

Done! Feel free to add more info there.

medmunds commented 2 months ago

I've taken another look at this, and now come to the conclusion that it is actually an Anymail bug. And the fix you originally identified is correct. (I'm adding some extra tests to make sure.)

Here's what my example above (and Anymail) is doing wrong:

import email.policy
import django.core.mail

to = '"Người nhận a very very long, name" <to@example.com>'
mime_message = django.core.mail.EmailMessage(to=[to]).message()
print(mime_message.policy)
# Compat32()

mime_message.policy = email.policy.default.clone(cte_type="7bit")  # <-- BUG!

Django's .message() returns a legacy Python email.message.Message object, and trying to overwrite its legacy Compat32 policy with a modern one causes an unholy mix of new and old header encoding code that wasn't intended to work together.

That attempt to override policy is unnecessary in Anymail's Amazon SES backend, because Anymail has already reencoded all parts as 7bit, so the only thing left to worry about are the root message headers. And they should already be 7bit clean. So—as you originally noted—removing the line that sets policy will fix this.

Sorry for my confusion. Fix follows.


[More details, and leaving some notes for the future…]

Incidentally, I also tried using a generator-time policy override, which seems to be a documented approach. Unfortunately, it still exhibits the bug—probably because it's still mixing legacy and modern email header code under the hood:

import email.generator
import email.policy
import io
import django.core.mail

to = '"Người nhận a very very long, name" <to@example.com>'
mime_message = django.core.mail.EmailMessage(to=[to]).message()

fp = io.BytesIO()
g = email.generator.BytesGenerator(
    fp, 
    policy=email.policy.default.clone(cte_type="7bit"),
)
g.flatten(mime_message)
print(fp.getvalue().decode("ascii"))
# [... other headers ...]
# To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1u?= a very very long, name <to@example.com>

This probably should be considered a bug in the Python email package, though I suspect the appropriate fix might be to raise an error about trying to mix incompatible header registries (or something like that).

Finally, if you change that last example to use a 7bit Compat32 policy, it does solve this problem—but runs into other issues if you have (e.g.,) non-ASCII content in the message body:

...
g = email.generator.BytesGenerator(
    fp, 
    policy=mime_message.policy.clone(cte_type="7bit"),  # Compat32
)
...
# To: =?utf-8?b?TmfGsOG7nWkgbmjhuq1uIGEgdmVyeSB2ZXJ5IGxvbmcsIG5hbWU=?=
#  <to@example.com>
andresmrm commented 2 months ago

Thanks for looking further into this issue. =)