emersion / go-message

✉️ A streaming Go library for the Internet Message Format and mail messages
MIT License
373 stars 108 forks source link

Unhandled charset "ISO-2022-JP" on WriteTo, despite it being present #169

Closed dvcrn closed 1 year ago

dvcrn commented 1 year ago

Hi,

first of all thanks for maintaining this! It takes away a huge amount of headache with parsing messages. I keep getting logs messages about unhandled charsets:

    out := bytes.Buffer{}
    if err := entity.WriteTo(&out); err != nil {
        fmt.Println("error writing entity", err.Error())
        return nil, err
    }

// -> error writing entity unhandled charset "ISO-2022-JP"

Importing

    "github.com/emersion/go-message"
    charset "github.com/emersion/go-message/charset"

at the beginning of the file. Also tried manually specifying the ISO-2022-JP charset (though that's already done with the same encoding package in the packages encoding uses under the hood):

func init() {
    charset.RegisterEncoding("iso-2022-jp", japanese.ISO2022JP)
    charset.RegisterEncoding("ISO-2022-JP", japanese.ISO2022JP)
}

ISO2022JP is not a small cryptic charset and should be supported out of the box in the charset package, eg : https://cs.opensource.google/go/x/text/+/refs/tags/v0.10.0:encoding/ianaindex/ianaindex.go;l=206-209

Is this supposed to work as is, or does writeTo not consider the charset the same way Read() does?

emersion commented 1 year ago

Only UTF-8 is supported for writing. Creating new non-UTF-8 messages is doing a disservice to the ecosystem.

dvcrn commented 1 year ago

While I understand that, it's sadly often not up to us to decide that, especially when integrating with existing services

Any quick recommendation to re-encode an entire message into UTF8? Was hoping WriteTo would just do that because reading is already decoding all the charsets

    t, _, err := entity.Header.ContentType()
    if err != nil {
        fmt.Println("error reading content type", err.Error())
        return nil
    }
    entity.Header.SetContentType(t, map[string]string{"charset": "utf-8"})

In my quick test that resulted in a lot of chaos when those emails were rendered

emersion commented 1 year ago

Hm, yeah, go-message should encode everything as UTF-8. What kind of chaos were you running into?

dvcrn commented 1 year ago

I'm using the library in combination with go-imap. Adding SetContentType is getting rid of the "Unhandled charset" error, but causing emails to get rendered like this:

IMG_0360

Full code:

func RewriteHeader(emailBytes []byte, targetHeader string, targetValue string) ([]byte, error) {
    targetHeader = strings.TrimSpace(targetHeader)
    targetValue = strings.TrimSpace(targetValue)

    entity, err := message.Read(bytes.NewReader(emailBytes))
    if err != nil {
        fmt.Println("error reading message into entity", err.Error())
        return nil, err
    }

    entity.Header.Set(targetHeader, targetValue)

    // reconstruct email with rewritten headers
    out := bytes.Buffer{}

    //t, _, err := entity.Header.ContentType()
    //if err != nil {
    //  fmt.Println("error reading content type", err.Error())
    //  return nil, err
    //}
    //entity.Header.SetContentType(t, map[string]string{"charset": "utf-8"})
    if err := entity.WriteTo(&out); err != nil {
        fmt.Println("error writing message into entity", err.Error())
        return nil, err
    }

    return out.Bytes(), nil
}

I didn't have time to properly debug this yet, may be something obvious as well

emersion commented 1 year ago

I don't think it's correct to set the Content-Type on the whole entity: it's a multipart message.

Although it's a bit surprising to me, WriteTo will not decode charsets for nested parts... So I wonder what's going on here. Maybe you're hitting the "invalid charset" issue only for single-part messages?

A quick fix would be to check strings.HasPrefix(t, "multipart/") before setting the utf-8 charset. Also I'd recommend retaining the params returned by ContentType(), only overwriting the charset but keeping the rest of the params as-is.

But if you just want to set a header value, you don't need any of go-message's charset stuff: you can just use the lower-level go-message/textproto package instead.

dvcrn commented 1 year ago

Thanks, let me give that a try and report back!

But if you just want to set a header value, you don't need any of go-message's charset stuff: you can just use the lower-level go-message/textproto package instead.

Yes, but the charset stuff is what makes it so nice to extract some headers, act on them and conditionally overwrite, otherwise I would need to decode the headers manually based on the charset that they're in. go-message does all of that for me 😄

emersion commented 1 year ago

If you are only interested in decoding header fields, and can treat the body as an opaque blob, then you can use go-message/textproto to read the header, then wrap it with message.Header to get the charset-enabled functions.

dvcrn commented 1 year ago

Forgot to update - using textproto directly, then wrapping with message.Header works perfectly for extracting stuff, thanks for the tip. Skipping content-type headers that start with multipart/ stopped causing rendering weirdness in email clients, but go-message refuses to write messages with those into a non-utf8 charset as described earlier

eg: unhandled charset "windows-1252" header: multipart/alternative map[boundary:----=_NextPart_033_6113F784.6113F784 charset:windows-1252], where the charset is part of the multipart

My final solution was more or less what you suggested - only touch the headers with lowlevel textproto and don't do anything with the body. Something like this:

    hdr, err := textproto.ReadHeader(bufio.NewReader(bytes.NewBuffer(emailBytes)))
    if err != nil {
         panic(err)
    }

    wrappedHdr := message.Header{Header: hdr}
    wrappedHdr.Set(targetHeader, targetValue)

    mail, err := mail.ReadMessage(bytes.NewReader(emailBytes))
    if err != nil {
         panic(err)
    }

    out := bytes.Buffer{}

    if err := textproto.WriteHeader(&out, wrappedHdr.Header); err != nil {
         panic(err)
    }
    if _, err := io.Copy(&out, mail.Body); err != nil {
         panic(err)
    }
emersion commented 1 year ago

Note, charset is meaningless for multipart messages.