deltachat / deltachat-core

Delta.Chat C-Library with e2e chat-over-email functionality & Python bindings
https://c.delta.chat
Other
304 stars 26 forks source link

Attachment not accessable due to no decoding of filename #162

Closed csb0730 closed 6 years ago

csb0730 commented 6 years ago

Some attachments use special encoding of filename. See picture. This encoding is described in RFC but here it is not properly decoded and bad filenames are generated to store. As a result these attachments are not accessable later.

screenshot-wrong-decoding-filename

r10s commented 6 years ago

might be related to https://github.com/deltachat/deltachat-core/issues/98

csb0730 commented 6 years ago

Yes, in some way. #98 describes encoding, this issue here describes decoding of filenames (text). RFC2047 seems to be the related issue.

csb0730 commented 6 years ago

Hi @r10s, if You tell me where the best code position is to start an investigation I'll investigate in that. I think we need here simply a decoding functionality for RFC2047 encodings and that's it.

Are we again in mrmimeparser.c ? do_add_single_file_part() ?

r10s commented 6 years ago

yes, around there. i would start around https://github.com/deltachat/deltachat-core/blob/master/src/mrmimeparser.c#L1137 and check the different sources of desired_filename - i've just added some comments about which headers are parsed.

the filename itself comes from libEtPan, probably with different encodings for the first and the second source.

regarding the encoding, Wikipedia says RFC 2231 - https://en.wikipedia.org/wiki/MIME#Content-Disposition - so, this should be double-checked :)

csb0730 commented 6 years ago

As far as I see RFC2231 is not related to the issue here. An examination of the .eml file shows that simply the filename of the attachment is not decoded. It's build as an encoded word construct related RFC2047.

Here an excerpt from the example email source:

Subject: =?iso-8859-1?Q?xxxxxx xxxx =D6sterreich xxx =22xxxxxxxxxx=22 xxx?=

==> is displayed correctly as

xxxxxx xxxx Österreich xxx "xxxxxxxxxx" xxx

Now the attachment:

------=_NextPart_000_0044_01D3EBBB.EBD4A630`

Content-Type: application/pdf;
    name="=?iso-8859-1?Q?xxxxxxxxxxxxxxxxxx.xxx-xxxxxx_xxxx_=D6sterreich_xxxxxxxxx-?=
    =?iso-8859-1?Q?xxxx_xxx.pdf?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
    filename="=?iso-8859-1?Q?xxxxxxxxxxxxxxxxxx.xxx-xxxxxx_xxxx_=D6sterreich_xxxxxxxxx-?=
    =?iso-8859-1?Q?xxxx_xxx.pdf?="

==> here no decoding of filename, full encoded line is used as filename =?iso-8859-1?Q?xxxxxx ...

correct filname would be (is):

xxxxxxxxxxxxxxxxxx.xxx-xxxxxx xxxx Österreich xxx xxxxx-xxxx xxx.pdf

csb0730 commented 6 years ago

I think before or after https://github.com/deltachat/deltachat-core/blob/master/src/mrmimeparser.c#L1181 is the correct position to decode filename if required:

1181: mr_replace_bad_utf8_chars(desired_filename);

1183: do_add_single_file_part(ths, msg_type, mime_type, decoded_data, decoded_data_bytes, desired_filename);

But because the Subject: is decoded properly: There should exist a function somewhere which can be used to decode the filename? Simply use it here? :)

csb0730 commented 6 years ago

What about _mr_decode_headerstring() in mrtools.c ? It seems to do that :-)

csb0730 commented 6 years ago

@r10s: Did You see last comments?

r10s commented 6 years ago

@csb0730 yes, mr_decode_header_string() does this decoding.

playing around a bit: attaching a file with the name testäöü.txt in thunderbird gets encoded as

Content-Type: text/plain; charset=UTF-8;
 name="=?UTF-8?B?dGVzdMOkw7bDvC50eHQ=?="

and the name is decoded correctly in Delta Chat. decoding is done here: https://github.com/deltachat/deltachat-core/blob/master/src/mrmimeparser.c#L1167 and is fine.

wondering which app you have used at https://github.com/deltachat/deltachat-core/issues/162#issuecomment-388990448

r10s commented 6 years ago

okay, as you mentioned, when filetype came from Content-Disposition ... filename= the name is not decoded. added this.

csb0730 commented 6 years ago

Hi @r10s,

  1. These mails are coming from Outlook 14 MUA and
  2. Pay attention to the fact, that from RFC2047 perspective there are two possible ways to encode filename:

Above You referenced always to the "B" encoding. This seems to work. But the "Q" encoding is obviously not working!

So I recommend to reopen this issue.

csb0730 commented 6 years ago

See my comment 14 days ago with reference to mr_decode_header_string() I think this could be the right way, isn't it?

csb0730 commented 6 years ago

I think I missed obviously the essential part in dd1b4fc ! See my last comments as an additional explanation but I think this issue is really closed now

;-)

r10s commented 6 years ago

I think this issue is really closed now

Great :)