Import-Export tool (and Proton-Bridge) badly parses Content-Type and Content-Disposition filename is encode by RFC2047

exander77 commented 3 years ago

So I finally tracked down the issue of invalid media parameter. At least one instance is caused by:

Content-Type: text/plain;
 name==?UTF-8?B?dGVzdC50eHQ=?=
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename==?UTF-8?B?dGVzdC50eHQ=?=

This is completely against RFC2047:

   + An 'encoded-word' MUST NOT be used in parameter of a MIME
     Content-Type or Content-Disposition field, or in any structured
     field body except within a 'comment' or 'phrase'.

But it, for example, covers all messages sent from Tutanota which includes an attachment. So currently it means that people can't migrate from Tutanota to ProtonMail at all. Which is not really good from the business standpoint as Tutanota is one of the ProtonMail's biggest competitors.

@jameshoulahan Any thoughts?

bartbutler commented 3 years ago

This is the kind of thing that we typically just try to make work anyway despite technically not being RFC compliant in we can--the whole "be relaxed in what you accept and strict in what you send" mentality.

jameshoulahan commented 3 years ago

Yep, I'm able to reproduce with a message exported from tutanota.

The error happens inside the go-message library's function here. The method is private to the library so any fixes would have to be upstreamed or made in a forked version of the library.

We might be able to do some preprocessing on the parsed header values, decoding any encoded words in things like attachment filenames before making calls to the library methods.

jameshoulahan commented 3 years ago

Preprocessing is also difficult. The go method mime.ParseMediaType(...) is unable to handle these encoded media type parameters at all and gives up when it sees them. It still returns the media type itself (in this case, attachment/pdf and attachment) despite the error, but doesn't return any of the media type parameters (in this case, name and filename), meaning we can't go and decode the encoded filename ourselves.

Options:

hacky regex based detection of the encoded words to preprocess them before giving them to mime.ParseMediaType(...)
use a fallback filename (since mime.ParseMediaType(...) returns the media type despite the error)
write our own media type parser and only use that, never calling go-message's methods to get the content type/disposition

exander77 commented 3 years ago

@bartbutler @jameshoulahan

I am processing around 150k more emails and I have found more of this. So this is definitely out there. Probably some services and some email clients do this. The server probably handles it already as I think I have received emails from Tutanota without any problems.

Yes, it is deep in the libs and it is not easily fixable there either as you want to probably use the WordDecoder you have.

This is related to:

Unquoted boundary: https://github.com/ProtonMail/proton-bridge/issues/121,
duplicate charset in Content-Type in attachment, there is a workaround for the message header there, but is not used for attachments, this charset=binary; charset=UTF-8; causes duplicate parameter name error if it is in attachment header,
changeEncodingAndKeepLastParamDefinition hack for handling: ParseMediaType from MIME doesn't support RFC2231 for non asci / utf8 encodings so we have to pre-parse it.

So maybe writing a parser for Content-Type and Content-Disposition would solve all of these.

andrzejsza commented 3 years ago

with the new version of Bridge (1.8.5) the majority of message parsing in delegated to the backed which is now handling this properly. as for the I-E app, we won't be rewriting the parser at this stage - Import Assistant will do the job.

exander77 commented 3 years ago

@andrzejsza Should I try to re-migrate my e-mails with Import Assistant?

andrzejsza commented 3 years ago

yes, I'd suggest you do. Bridge is now using, for most parts, the same parser as the IA (for the imports).

jameshoulahan commented 3 years ago

Just to clarify what @andrzejsza means: when import messages, bridge now simply iterates through the mime structure, encrypting each body in place, before handing it over to the same serverside parser that is used by the import assistant and to parse incoming mail. So the promise of e2ee is still kept as all bodies are locally encrypted. The serverside parser is in general much more forgiving when it comes to weird edge cases (like we have here, with rfc2047-encoded attachment filenames).

ProtonMail / proton-bridge

Import-Export tool (and Proton-Bridge) badly parses Content-Type and Content-Disposition filename is encode by RFC2047 #126