System.Net.Mail does not encode headers on code point boundaries

atheken commented 7 years ago

In order for unicode headers to be transmitted over SMTP, they must be encoded using the encoding outlined in RFC-2047. This is sometimes known as "Encoded-Word" or "Q-Encoding."

SMTP headers should also be limited in line length, and RFC-2047 sets a maximum line length of 76 characters. Therefore, when encoding unicode headers, it is typical to need to fold headers onto multiple lines.

A common example might be something like the following:

Subject: =?ISO-8859-1?B?SWYgeW91IGNhbiByZWFkIHRoaXMgeW8=?=
 =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=

It is really important to note that the base64 encoding in the above example uses two different character sets, this is not a particularly common example, but this is definitely legal. The binary data for an individual code point should never be split between two "Encoded Words."

A more practical example is Unicode Emojis. Emojis frequently span multiple (and variable) bytes. In the case of UTF-8 individual code-points could be 1-4 bytes. Splitting emoji bytes on to multiple lines can result in the "unicode block (�)" appearing in some mail clients because the multi-byte characters are decoded separately. In other cases, spurious spaces are included.

The current Base64Stream (and QuotedPrintableStream, it would seem) do not account for the byte boundaries of the encoded code points:

https://github.com/dotnet/corefx/blob/master/src/System.Net.Mail/src/System/Net/Base64Stream.cs#L230-L246

At this level, the streams are (mostly) unaware of any text encoding semantics, and just write as many bytes as possible on each line.

Instead, these encoding streams need to account for code-point byte boundaries, and fold the line preemptively, if only part of a code point can be included in the line before the next line fold would occur.

I'm a little bit concerned that the stream is the wrong level of abstraction for this type of handling. Perhaps including some sort of "look-ahead" on the WriteState is a better option for determining the next smallest block of bytes that can be written is.

I have a fork of Corefx and will be fixing it on my own. Please let me know if this is something the team is interested in fixing, and I can provide a patch.

davidsh commented 7 years ago

Do we know if this works in .NET Framework? In general, the code was ported from .NET Framework to .NET Core.

atheken commented 7 years ago

It is also broken in .Net Framework 4.6.2 (didn't test earlier versions):

using System.Net;
using System.Net.Mail;

namespace MinimalSendingReproduction
{
    class Program
    {
        static void Main(string[] args)
        {
            var password = "<PASSWORD>";
            var sender = "<SENDER>";
            var message = new MailMessage(sender, sender);
            message.Subject = "An example  : 😍🍕📩😍🍕📩😍🍕📩😍🍕📩";
            message.Body = "Hello, this is an example body";

            using (var s = new SmtpClient("smtp.gmail.com", 587))
            {
                s.EnableSsl = true;
                s.Credentials = new NetworkCredential(sender, password);
                s.Send(message);
            }
        }
    }
}

Yields this Subject:

Subject: =?utf-8?B?QW4gZXhhbXBsZSAgOiDwn5iN8J+NlfCfk6nwn5iN8J+NlfCf?=
 =?utf-8?B?k6nwn5iN8J+NlfCfk6nwn5iN8J+NlfCfk6k=?=

Which Gmail "fixes", but other email clients don't handle this gracefully.

You can see an example of how this breaks in this tool: http://dogmamix.com/MimeHeadersDecoder/

Then, compare that to this correct header (same thing, but removed folding between char boundaries):

Subject: =?utf-8?B?QW4gZXhhbXBsZSAgOiDwn5iN8J+NlfCfk6nwn5iN8J+NlfCfk6nwn5iN8J+NlfCfk6nwn5iN8J+NlfCfk6k=?=

davidsh commented 6 years ago

I tried using the emoji example above for the subject. I sent a message from Outlook and looked at the MIME encoding.

It seems to follow the same rules as .NET Framework.

Subject: =?utf-8?B?QW4gZXhhbXBsZSAgOiDwn5iN8J+NlfCfk6nwn5iN8J+NlfCfk6nwn5iN?=
 =?utf-8?B?8J+NlfCfk6nwn5iN8J+NlfCfk6k=?=

So, perhaps it is also encoding this wrong. The encoder tool at http://dogmamix.com/MimeHeadersDecoder/ shows illegal characters:

But the visual display in the Outlook email client looks correct. So, it must be fixing things up.

atheken commented 6 years ago

@davidsh The issue is in the order that the Encoded Word content gets decoded; If the mail reader combines the base64 content before converting it to UTF-8, then it'll appear correctly, then I think that is a broken implementation, or at a minimum, fairly brittle and misses some common edge cases. Each "encoded word" should be decoded separately. (It's entirely reasonable to mix non-encoded word atoms with ascii in a header, or to have encoded words that use different character sets, making combining the binary together, first, incorrect.)

logiclrd commented 4 years ago

I just ran into this same bug in the course of my work. An e-mail with subject line:

🚨 Service Down on IQ-RGINTW025 🚨

...ends up with the line split in the middle of the second 🚨 character. Some mail implementations happen to decode this correctly, some definitely don't.

Postel's rubustness principle states:

...: be conservative in what you do, be liberal in what you accept from others

Based on this principle, I think it is entirely reasonable that there exist clients that can reassemble characters split across separately-encoded byte sequences -- but .NET's implementation should not be making them do so in the first place.

dotnet / runtime

System.Net.Mail does not encode headers on code point boundaries #1485