emersion / hydroxide

A third-party, open-source ProtonMail CardDAV, IMAP and SMTP bridge
MIT License
1.56k stars 120 forks source link

text/plain being lost #112

Open samhh opened 4 years ago

samhh commented 4 years ago

I'm using Hydroxide with aerc and noticed that none of the emails I read are coming through as text/plain, even if the header in the ProtonMail web UI says Content-Type: text/plain. I've run Hydroxide with -debug and am seeing the following:

NteoSA UID STORE 2078 +FLAGS.SILENT (\Seen)
2020/07/30 13:53:07 >> PUT /api/messages/read
2020/07/30 13:53:07 {"IDs":["REDACTED"]}
2020/07/30 13:53:07 << PUT /api/messages/read
2020/07/30 13:53:07 &protonmail.resp{Code:1001, RawAPIError:(*protonmail.RawAPIError)(nil)}
2020/07/30 13:53:07 >> GET /api/events/REDACTED
2020/07/30 13:53:07 << GET /api/events/REDACTED
2020/07/30 13:53:07 &struct { protonmail.resp; *protonmail.Event }{resp:protonmail.resp{Code:1000, RawAPIError:(*protonmail.RawAPIError)(nil)}, Event:(*protonmail.Event)(0xc000182d80)}
NteoSA OK UID STORE completed
HTCrwQ UID FETCH 2078 (ENVELOPE UID BODYSTRUCTURE FLAGS BODY.PEEK[1.MIME] BODY[1])
2020/07/30 13:53:07 >> GET /api/messages/REDACTED
2020/07/30 13:53:07 << GET /api/messages/REDACTED
2020/07/30 13:53:07 &struct { protonmail.resp; Message *protonmail.Message }{resp:protonmail.resp{Code:1000, RawAPIError:(*protonmail.RawAPIError)(nil)}, Message:(*protonmail.Message)(0xc000682d80)}
2020/07/30 13:53:07 >> GET /api/messages/REDACTED
2020/07/30 13:53:07 << GET /api/messages/REDACTED
2020/07/30 13:53:07 &struct { protonmail.resp; Message *protonmail.Message }{resp:protonmail.resp{Code:1000, RawAPIError:(*protonmail.RawAPIError)(nil)}, Message:(*protonmail.Message)(0xc000703980)}
* 2078 FETCH (ENVELOPE ("Wed, 29 Jul 2020 21:49:25 +0100" "REDACTED" (("REDACTED" NIL "forwarding-noreply" "REDACTED")) () () ((NIL NIL "REDACTED" "REDACTED")) () () "" "REDACTED") UID 2078 BODYSTRUCTURE (("text" "html" () NIL NIL "quoted-printable" 0 0 NIL ("inline" ()) NIL NIL) "mixed" NIL NIL NIL NIL) FLAGS (\Seen) BODY[1.MIME] {73}
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain

 BODY[1] {1686}
REDACTED
)
HTCrwQ OK UID FETCH completed
2lf9uQ UID FETCH 2078 (BODYSTRUCTURE ENVELOPE INTERNALDATE FLAGS UID BODY.PEEK[HEADER])
* 2078 FETCH (BODYSTRUCTURE (("text" "html" () NIL NIL "quoted-printable" 0 0 NIL ("inline" ()) NIL NIL) "mixed" NIL NIL NIL NIL) ENVELOPE ("Wed, 29 Jul 2020 21:49:25 +0100" "REDACTED" (("REDACTED" NIL "forwarding-noreply" "REDACTED")) () () ((NIL NIL "REDACTED" "REDACTED")) () () "" "REDACTED") INTERNALDATE "29-Jul-2020 21:49:25 +0100" FLAGS (\Seen) UID 2078 BODY[HEADER] {406}
Content-Type: multipart/mixed;
 boundary=REDACTED
Message-Id: REDACTED
To: <REDACTED>
From: "REDACTED" <REDACTED>
Subject: REDACTED
Date: Wed, 29 Jul 2020 21:49:25 +0100

)
2lf9uQ OK UID FETCH completed

Is this a bug or am I missing something obvious? :slightly_smiling_face: Cheers.

emersion commented 4 years ago

Might be a bug. But ProtonMail only keeps the text/html part when an email contains both text/plain and text/html, so maybe that's related too.

sbinet commented 3 years ago

ah, that's my issue as well, the one from the aerc-list

changing the default from "text/html" to "text/plain" "fixed" it for me: https://github.com/emersion/hydroxide/blob/86db01792a9ca03fc06c67a973c5ba35be878736/imap/message.go#L110

of course, that's probably not completely satisfactory a solution as Proton's default seems to be text/html.

perhaps we could leave the "list message" code as is (so, leaving msg.MIMEType=="" and a body structure with "text/html") but then, when protonmail.Client.GetMessage(id) comes along, update the MIMEType field with the correct value in the db?

or perhaps there's a way to tell the Proton server to give more informations?

sbinet commented 3 years ago

circling back to this. I've been using the patch from #129 since Jan 2021. it's working ok for some messages but not for all.

perhaps a better course of action would be to escalate this to Proton?

emersion commented 3 years ago

What happens when it doesn't work? Is it showing HTML content with the text/plain MIME type, or the other way around? Or maybe it's just showing HTML content with text/html?

samhh commented 3 years ago

In my case, in aerc, both text/plain and text/html (as described in the web UI headers) are detected/rendered as text/html.

sbinet commented 3 years ago

ditto. kind of. for example, "even with my patch" (for whatever value one may want to attach to this :P), mails from the lists.sr.ht lists get still labeled as text/html.

samhh commented 3 years ago

Worth noting that I don't see this issue with a Migadu account so this is probably a Hydroxide/PM issue rather than an aerc issue.

WyntrHeart commented 4 months ago

I still have this problem, text/plain emails are being treated as text/html, which is screwing with the formatting. Any update on a fix?

BlankEclair commented 2 months ago

I finally figured out a workaround, so I figure I should document it here (and some other workarounds I've thought of, but dismissed).

Treating messages with unknown types as text/plain

Currently, when hydroxide does not know a part's MIME type, it guesses text/html: https://github.com/emersion/hydroxide/blob/c964219ad4996d90b34d730b98a8c736b9bc9921/imap/message.go#L116-L120

One could replace inlineSubType := "html" with inlineSubType := "plain" to make it treat unknown types as text/plain, but this would cause HTML emails to be treated as plain text.

If you then use a filter that reformats text/html emails as plain text, this would cause the message to be displayed raw. Perhaps good enough for those who don't use such filters or who rarely receive HTML mail, but they're unfortunately too common for me to accept.

Loading all messages when listing

The reason why text/plain is lost is because Protonmail doesn't send the MIME type when emails are being listed, but it is sent when a specific message is fetched. Hydroxide could be set up to fetch all messages when listing before they are handled, and Hydroxide does in fact do this for emails with attachments: https://github.com/emersion/hydroxide/blob/c964219ad4996d90b34d730b98a8c736b9bc9921/imap/message.go#L108-L114

Someone could simply remove the if case to make it fetch all emails. However, this would, well, cause every message to be fetched when listing all emails, which can be slow. Perhaps there could be a local cache for this so it becomes less slow in future runs, but I'm too lazy to figure out how to implement that.

Additionally, at least when using aerc, this causes Non-multipart body part doesn't have 7 fields errors on certain multipart/mixed emails. (I've looked at the traffic via Wireshark, and apparently the MIME type sent to the client in the BodyStructure is "mixed" NIL? I'm not familiar with IMAP so I don't know if I use the proper terminology, and I am, again, too lazy to figure out a proper fix).

Having your text/html filter only activate on emails with <html

Personally, the symptom I face with this bug are plain text emails being fed through an HTML renderer, which collapses whitespace (including newlines!), causing emails to be unreadable.

I eventually decided to make my text/html filter use file for the first 1024 bytes of the email to check that it is an HTML email. If so, it runs it through what I would use to process HTML emails. If not, it runs it through what would've been done for plaintext emails:

#!/bin/sh
# Workaround Hydroxide returning text/html for plain text emails

TEMPFILE=$(mktemp)
cat > "$TEMPFILE"

if [ "$(head -c 1024 $TEMPFILE | file --brief --mime-type -)" = text/html ]; then
    # Replace this block with code that is run for text/html emails

    # aerc filter which runs w3m using socksify (from the dante package) to prevent
    # any phoning home by rendered emails
    export SOCKS_SERVER="127.0.0.1:1"
    socksify w3m \
        -I UTF-8 \
        -T text/html \
        -cols "$(tput cols)" \
        -dump \
        -o display_image=false \
        -o display_link_number=true \
        "$TEMPFILE"
    EXIT=$?
else
    # Replace this block with code that is run for text/* emails

    awk -f /usr/lib/aerc/filters/plaintext < "$TEMPFILE"
    EXIT=$?
fi

rm "$TEMPFILE"
exit $EXIT

I saved the above script to ~/.config/aerc/html-filter, and changed my text/html filter from /usr/lib/aerc/filters/html to ~/.config/aerc/html-filter. So far, works flawlessly well enough.

Edit: Of course, just as I said that scanning the first 1024 bytes for <html "works flawlessly", I find out that Github emails do not contains that substring. At least it contains <script>, which is handled by file(1). I've updated this section to now use file.