deltachat / deltachat-core

Delta.Chat C-Library with e2e chat-over-email functionality & Python bindings
https://c.delta.chat
Other
304 stars 26 forks source link

When sending attachments, use proper encoding for filenames #98

Closed hpk42 closed 6 years ago

hpk42 commented 6 years ago

while working on parsing/decrypting messages sent from Delta with muacrypt i noticed that DC encodes filenames for filenames in UTF-8. As @dkg pointed out on #autocrypt this rather needs to use a special encoding used in headers (header values need to be 7-bit clean). For example, mutt creates this for an umlaut-containing filename:

Content-Disposition: attachment; filename*=iso-8859-1''%FCbersicht%2Etxt

enigmail rather uses encodings as found in RFC2047 and particularly the "8. example" section.

Mutt seems to be using RFC5987 which some claim is the right solution at least in http headers. @dkg says the mutt filename parameter encoding is rather what RFC2231 specifies ...

r10s commented 6 years ago

@hpk42 send you a test message - if this works, feel free to merge PR #169 (that does exactly this change) or let me know.

r10s commented 6 years ago

okay, it does not work, would have been too simple :)

i try to summarize things a bit to see things clearer myself:

I. This is what is sent out by Delta with the encode-filename-fix from above ...

Content-Disposition: attachment:
 filename="=?utf-8?Q?test=C3=A4=C3=B6=C3=BC.txt?="

... understood at least by Thunderbird and K-9 but not by mutt. (without the encode-filename-fix from above, Delta would sent out filename="testäöü.txt" what is definitely wrong for non-ascii characters

II. This is what is sent out by Thunderbird ...

Content-Type: text/plain; charset=UTF-8;
 name="=?UTF-8?B?dGVzdMOkw7bDvC50eHQ=?="
Content-Disposition: attachment;
 filename*=utf-8''%74%65%73%74%C3%A4%C3%B6%C3%BC%2E%74%78%74

... if the attachment name is plain ascii, filename instead of filename* is used. The filename is also added to Content-Type->name where it is picked up in Delta (Delta tries to reads the name from Content-Disposition->filename first and then from Content-Type->name. Currently Delta does regard Content-Disposition->filename*).

III. This is what is sent out by K-9 ...

Content-Type: text/plain;
 name="=?ISO-8859-1?B?dGVzdOT2/C50eHQ=?="
Content-Disposition: =?ISO-8859-1?Q?attachment=3B=0D=0A_filename=3D?= =?ISO-8859-1?Q?=22test=E4=F6=FC=2Etxt=22=3B=0D=0A_size=3D39?=

... K-9 decodes the whole Content-Disposition-Field (@Valodim is this correct? or even the only correct way? at least Delta resp. libEtPan does not understand this (but works as reading Content-Type->name) which decodes to attachment;\r\n filename="testäöü.txt";\r\n size=39, quite same as in Delta, however, I assume, through the additional Content-Type->name parameter, mutt can read this - @hpk42 can you confirm this?

UPDATE: IV. This is what is sent out by mutt ...

Content-Disposition: attachment; filename="umlaut2.png"

if the filename contains non-ascii-characters, mutt switches to filename*:

Content-Disposition: attachment; filename*=iso-8859-1''h%E4%E4llo%2Etxt

Content-Type.name seems not to be set by mutt.


The question is, what to sent out. Some options:

  1. add the filename in the =?utf-8?-encoding also to Content-Type->name, i would assume, mutt would pick it up from there. this would be the easiest solution as no additional encoder is required. Content-Type->name might have a different semantic as Content-Disposition->filename, however, in practice, i've not seen differences. Also, this is done also by K-9 and Thunderbird, so it cannot be that wrong :)

  2. support filename*, however, this would be some work as one would probably also support reading this format. Moreover, this would not make 1. superfluous.

r10s commented 6 years ago

after some testing: mutt seems not pick up the name from Content-Type.name, so 1. would not help in this case - can anyone confirm this? Also if attachment names from K-9 -> mutt work?

seems as if 2. is the way to go (this is also the Thunderbird approach)

csb0730 commented 6 years ago

By the way

filename*=utf-8''%74%65%73%74%C3%A4%C3%B6%C3%BC%2E%74%78%74

IMHO this is RFC2231 encoding which mutt uses.

This means in detail:

filename*= means rfc 2231 encoded text

csb0730 commented 6 years ago

One remark to the example above:

Content-Type: text/plain; name="=?ISO-8859-1?B?dGVzdOT2/C50eHQ=?=" ==> this is ok, parameter name itself is not encoded (parameter name is "name" ;-) )

Content-Disposition: =?ISO-8859-1?Q?attachment=3B=0D=0A_filename=3D?= =?ISO-8859-1?Q?=22test=E4=F6=FC=2Etxt=22=3B=0D=0A_size=3D39?= ==> encoding of "filename" here is not ok, parameter name itself is encoded !

rfc 2047 says:

+ An 'encoded-word' MUST NOT be used in parameter of a MIME Content-Type or Content-Disposition field, or in any structured field body except within a 'comment' or 'phrase'.

r10s commented 6 years ago

IMHO is RFC2231 encoding.

yes, it is. I think we should check if a filename only contains plain-ascii characters, if so, i we use filename otherwise we will use filename* with RFC2231 encoding. In addition to that, we add the file name to ContentType->name

This is what thunderbird does currently.

r10s commented 6 years ago

closed by accident ...