emersion / go-message

✉️ A streaming Go library for the Internet Message Format and mail messages
MIT License
382 stars 111 forks source link

Inline vs Attachment part logic is insufficient for emails in the wild #138

Open vkryukov opened 3 years ago

vkryukov commented 3 years ago

Hello,

with the current implementation, interface PartHeader is implemented as either InlineHeader or AttachmentHeader that both embed message.Header. And although it's not documented by the package, the wiki example implies that PartHeader will always be one of the two.

The logic to decide whether its AttachmentHeader or PartHeader,

if disp == "inline" || (disp != "attachment" && strings.HasPrefix(t, "text/")) {
    mp.Header = &InlineHeader{p.Header}
} else {
    mp.Header = &AttachmentHeader{p.Header}
}

unfortunately, doesn't always work for emails in the wild. E.g., I have a message with the following part header:

Content-Type: text/plain; name="emailreceipt_20121015R2315576090.pdf"
Content-Disposition: inline; filename=emailreceipt_20121015R2315576090.pdf

%PDF-1.4..%...

It's clearly a PDF attachment, but is currently classified as InlineHeader, so we cannot use AttachmentHeader.Filename to extract the filename.

To solve this problem, the user code can of course cast PartHeader as Header (which will succeed with the current implementation), and then effectively re-implement AttachmentHeader.Filename logic by perusing Header.ContentDisposition and Header.ContentType methods.

The alternative would be, at a minimum, to extend PartHeader interface to directly expose ContentDisposition and ContentType, which will help the user code that works with malformed parts. (If this indeed a desired direction, I'm happy to create a pull request).

Another alternative would be to improve the Inline vs Attachment logic, but it will probably be brittle as you cannot enumerate badness in the wild.

Thoughts?

emersion commented 3 years ago

Hm, this seems to be allowed by the RFC and indicate that the PDF attachment should be displayed with the text of the message.

The RFCs allow very surprising things, such as multiple inline parts mixed together with a different MIME type each.

vkryukov commented 3 years ago

Thanks for your quick response!

I think this particular set of MIME headers is a violation of the standard, as e.g. RFC 2045 section 5 (Content-Type Header field) prescribes that

For this reason, registered subtypes of text, image, audio, and video should not contain embedded information that is really of a different type.

and RFC 2046 Section 3 says

text -- textual information. The subtype "plain" in particular indicates plain text containing no formatting commands or directives of any sort. Plain text is intended to be displayed "as-is". No special software is required to get the full meaning of the text, aside from support for the indicated character set.

Of course, we will always see a badly formatted email in the wild, and no library can expect to handle all of these cases - "you can't enumerate badness". I'm wondering what is the best way for a library like go-message to support users who have to deal with all of this weirdness.

Here is one example: we can extend PartHeader to include ContentType() and ContentDisposition() methods, as they can give additional useful information about the nature of the part if Content-Type in the header is lying like in the example above. This is not necessary, of course - in the current implementation, the following works perfectly fine:

type header interface {
    ContentType() (t string, params map[string]string, err error)
    ContentDisposition() (t string, params map[string]string, err error)
}
h, ok := p.Header.(header)
if !ok {
    // Impossible with the current implementation
}

but it would make writing the user code easier. Another good candidate would be Header.Fields() to enumerate all the fields in the part header.

If you feel that this is unnecessary, please feel free to close this issue.

emersion commented 3 years ago

Oh. text/plain with a PDF file, and all in a inline part. Completely missed that, I thought the MIME type was correct.

FWIW, it's also possible for have non-textual parts with an inline disposition. For instance Apple Mail will do this to embed images in a plain-text message: the message will contain interleaved text/plain and image/* parts, all of which are inline and wrapped in a multipart/mixed part.

Regarding your suggestion, maybe we can just add these to the existing PartHeader interface? I wonder how far we should go, there are a few more methods implemented by both types of headers.

emersion commented 3 years ago

Ah, this is exactly what you suggest. Yes, this sounds reasonable.

vkryukov commented 3 years ago

Cool - why don't I work on a pull request then? I will add ContentType and ContentDisposition for now.