Import-Export tool (and Proton-Bridge) badly parses Content-Type boundary with special characters like ? or = and fails on such multiplart messages

exander77 commented 3 years ago

So I finally get to debugging the biggest problem in multipart parsing.

Consider this example produced by NODEMAILER:

Content-Type: multipart/alternative; boundary=----NODEMAILER-?=_1-1327405538300

The boundary satisfies RFC1521:

   boundary := 0*69<bchars> bcharsnospace

   bchars := bcharsnospace / " "

   bcharsnospace :=    DIGIT / ALPHA / "'" / "(" / ")" / "+" /"_"
                 / "," / "-" / "." / "/" / ":" / "=" / "?"

   Since the hyphen character ("-") is represented as itself in the
   Quoted-Printable encoding, care must be taken, when encapsulating a
   quoted-printable encoded body in a multipart entity, to ensure that
   the encapsulation boundary does not appear anywhere in the encoded
   body.  (A good strategy is to choose a boundary that includes a
   character sequence such as "=_" which can never appear in a quoted-
   printable body.  See the definition of multipart messages later in
   this document.)

Yet it is not properly parsed. It fails in mime.ParseMediaType.

But:

   WARNING TO IMPLEMENTORS: The grammar for parameters on the Content-
   type field is such that it is often necessary to enclose the
   boundaries in quotes on the Content-type line.  This is not always
   necessary, but never hurts.  Implementors should be sure to study the
   grammar carefully in order to avoid producing illegal Content-type
   fields. Thus, a typical multipart Content-Type header field might
   look like this:

                 Content-Type: multipart/mixed;
                      boundary=gc0p4Jq0M2Yt08jU534c0p

   But the following is illegal:

                 Content-Type: multipart/mixed;
                      boundary=gc0p4Jq0M:2Yt08jU534c0p

   (because of the colon) and must instead be represented as

                 Content-Type: multipart/mixed;
                      boundary="gc0p4Jq0M:2Yt08jU534c0p"

RFC1521 states this for Content-Type parameters:

     parameter := attribute "=" value

     attribute := token   ; case-insensitive

     value := token / quoted-string

     token  :=  1*<any (ASCII) CHAR except SPACE, CTLs,
                   or tspecials>

     tspecials :=  "(" / ")" / "<" / ">" / "@"
                /  "," / ";" / ":" / "\" / <">
                /  "/" / "[" / "]" / "?" / "="
               ; Must be in quoted-string,
               ; to use within parameter values

   Note that the definition of "tspecials" is the same as the RFC 822
   definition of "specials" with the addition of the three characters
   "/", "?", and "=", and the removal of ".".

RFC2045 states this for Content-Type parameters:

     parameter := attribute "=" value

     attribute := token
                  ; Matching of attributes
                  ; is ALWAYS case-insensitive.

     value := token / quoted-string

     token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
                 or tspecials>

     tspecials :=  "(" / ")" / "<" / ">" / "@" /
                   "," / ";" / ":" / "\" / <">
                   "/" / "[" / "]" / "?" / "="
                   ; Must be in quoted-string,
                   ; to use within parameter values

   Note that the definition of "tspecials" is the same as the RFC 822
   definition of "specials" with the addition of the three characters
   "/", "?", and "=", and the removal of ".".

I see some confusion there with the comment "Must be in quoted-string, to use within parameter values". As tspecials are not mentioned as part of token and has to be quoted.

exander77 commented 3 years ago

@jameshoulahan What do you think is the best way to solve this? This takes 2/3 of my failed messages, all produced by NODEMAILER. It may be already changed for years: https://github.com/nodemailer/nodemailer/pull/29/commits/10e5ec59dd3960df1479bbd808bb93ee485d1506 But with the way RFC is written it may easily resurface.

exander77 commented 3 years ago

I think that this could use a custom parser as well. You are already hacking it a bit before passing it to golang mime library. And after an address and date parsing, it is the next important thing.

func (p *Part) ContentType() (string, map[string]string, error) {
  t, params, err := p.Header.ContentType()
  if err != nil {
    // go-message's implementation of ContentType() doesn't handle duplicate parameters
    // e.g. Content-Type: text/plain; charset=utf-8; charset=UTF-8
    // so if it fails, we try again with pmmime's implementation, which does.
    t, params, err = pmmime.ParseMediaType(p.Header.Get("Content-Type"))
  }

  return t, params, err
}

You are already wrapping it here. And I would note that it goes directly into golang mime in this issue.

jameshoulahan commented 3 years ago

Hi. These errors are known to us and are already on our backlog to investigate. I'm glad you went ahead and started to look into them already.

Currently, we use the github.com/emersion/go-message library to determine the message's mime structure. If that library can't read the message properly (one example is due to unreadable mime boundaries), bridge's hands are kind of tied.

With the example you gave in your first comment,

Content-Type: multipart/alternative; boundary=----NODEMAILER-?=_1-1327405538300

how far exactly did bridge get through the message.Parse(...) function (pkg/message/parser.go:38), do you know? I expect the error occured in the call to parser.New(r), which would return an error wrapped like failed to create new parser: .... That would mean that the go-message library itself couldn't read the message. If this happens, we can't continue to parse the rest of the message (headers, attachments, bodies, ...) because we don't know the mime structure.

Could you let me know exactly how far through message.Parse(...) the execution gets? If the error comes from within go-message, we will likely need to upstream changes to that repository.

exander77 commented 3 years ago

@jameshoulahan Yes it fails in the library when creating new parser:

failed to create new parser: multipart: boundary is empty:

Problem is in Content-Type parsing through ParseMediaType, the boundary is not correctly detected and all parsing fails because of that.

exander77 commented 3 years ago

I think it could be easily fixed with something like this, but it is happening in a library:

var reFixContentType = regexp.MustCompile("(boundary=)([^\"][^ ]*)") // nolint[gochecknoglobals] 
reFixContentType.ReplaceAllString(contentType, "$1\"$2\"")

I blame RFC1521 for this. It defines boundary separately and recommends =_ to be present in a boundary and then defines parameters of Content-Type. And if you use =_ without quoting described in parameters of Content-Type you have problems. So basically the definition of a boundary is dependant on a quoting mentioned in parameters. I have like 100 emails affected by this. The is the most prevalent issue I currently have.

elight commented 1 month ago

Is the Import-Export App even supported any more? The docs for the app are still up on the Proton site yet logging in fails on all platforms with a message of needing the latest version. Support, meanwhile, indicates that alternate import methods like "use Thunderbird". This... is bad.

ProtonMail / proton-bridge

Import-Export tool (and Proton-Bridge) badly parses Content-Type boundary with special characters like ? or = and fails on such multiplart messages #121