Special characters in attachment filenames don't work

seisvelas commented 3 years ago

Steps to reproduce:

Create a file whose name contains a character not allowed in URL's, like what's_up?.txt
Send the file via Secure Compose
Download the file from the recipient inbox
Viola, the filename on your machine will be what's_up_.txt

My guess would be that this filename is decoded from a URL, but a cursory glance at the relevant code didn't immediately reveal anything, so I'm still unsure.

I'll look deeper into this but in the meantime any ideas/suggestions are appreciated :)

tomholub commented 3 years ago

We do remove some naughty characters when sending, but what's_up_.txt is not the best.

seisvelas commented 3 years ago

The easy solution is to just URL decode the filename, so that what's_up_.txt becomes what's_up_.txt. But this is naive because what if the original filename actually was intended to be what's_up_.txt. There's no way to distinguish.

Unless we surround special characters with some kind of signifier to know that whatever we see there ought to be decoded (kinda like how Common Lisp & Clojure use GENSYM macros to achieve good-enough hygiene).

But I think that's overkill and we should just URL decode them. Not many people intend a file name to have ' in it. And if users do complain about it, we can always implement hygiene later.

So basically, my solution is to URL decode filenames. Which should be really easy, so just let me know if you want me to go ahead and make a PR :)

tomholub commented 3 years ago

Let's have @rrrooommmaaa have a look later - he was working on filenames recently, and may have a better understanding then us two about the proper encoding and decoding. It seems to me that we may want to adjust both ends.

tomholub commented 3 years ago

@seisvelas Or :) if you want to take a look - see what other email clients are doing. What does Gmail or Thunderbird do, for example.

IvanPizhenko commented 3 years ago

Here's how Thunderbird on Linux sends such file:

--------------2BD8AF7FAFE1FD10FF4173DB
Content-Type: text/plain; charset=UTF-8;
 name="what's_up?.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="what's_up?.txt"
...
--------------2BD8AF7FAFE1FD10FF4173DB--

And here is what Secure Compose does:

------sinikael-?=_1-16320706293100.10230587302860561
Content-Type: application/octet-stream; name="what's_up?.txt.pgp"
Content-Disposition: attachment; filename*0*=utf-8''what's_up%3F.txt.pgp
X-Attachment-Id: f_AMLYAEhORLSKqUdMxfiKFQKCXOTaUO@flowcrypt
Content-Id: <f_AMLYAEhORLSKqUdMxfiKFQKCXOTaUO@flowcrypt>
Content-Transfer-Encoding: base64
...

IvanPizhenko commented 3 years ago

In the my test, after secure compose, file name deviated even more than @seisvelas has indicated above. Here is what I have got: "what"'s_up%3F.txt

IvanPizhenko commented 3 years ago

I think it is produced by this function from email.js library: https://github.com/FlowCrypt/flowcrypt-browser/blob/2147b43d04eac7132c7b559905254cd042808b30/extension/lib/emailjs/emailjs-mime-codec.js#L647

It is used by MimeNode.prototype._buildHeaderValue(), which is described as "Joins parsed header value together as 'value; param1=value1; param2=value2'", which is used by MimeNode.prototype.build(), which "Builds the rfc2822 message from the current node.". All this is in the email.js library.

IvanPizhenko commented 3 years ago

continuationEncode has following following:

            var continuationEncodeChr = function(chr) {
                if (chr === '(') {
                    return '%28';
                } else if (chr === ')') {
                    return '%29';
                } else {
                    return encodeURIComponent(chr);
                }
            };

Call to encodeURIComponent() was here originally, but above was introduced by @rrrooommmaaa to fix issue #3352. Test case introduced in the #3357 reveals another interesting file name XX J 1 IT E (P 4) p_c.pdf. I will try to create file with similar name XX J 1 IT E (P 4) p_c.txt and send it from Thunderbird, and see how it will encode it.

IvanPizhenko commented 3 years ago

And here is content from Thunderbird:

--------------F41182E32C6B2C4A6F50A6E7
Content-Type: text/plain; charset=UTF-8;
 name="XX J 1 IT E (P 4) p_c.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
 filename="XX J 1 IT E (P 4) p_c.txt"

SGVsbG8sIHdvcmxkIQo=
--------------F41182E32C6B2C4A6F50A6E7--

As you can see, there no special encoding at all, even for those ( and ).

@rrrooommmaaa @tomholub Please explain why is it needed to involve URL encoding here? (Thunderbird seems to do something different). From that there is another question - maybe #3357 has fixed wrong thing? Maybe whole idea of using encodeURIComponent() is wrong and it should be encoded somehow differently? Maybe we should look into source code of the Thunderbird or whatever third party library it uses and port to JS/TS from C or C++ exactly what it does? (or find some ready JS library which does it more correctly).

UPDATE I have read a bit RFC822, and it says: "Each header field can be viewed as a single, logical line of ASCII characters, comprising a field-name and a field-body." Since both (, ), ', ? are valid ASCII characters, they should not need any special encoding in the email header. Then why do we apply URL encoding?

UPDATE Read RFC 2231, seems like this URL encoding comes from there, but it seems to be something wrong with it. Maybe it is Gmail bug? Seems like no. In Thunderbird it was decoded as what 's_up%3F.txt.pgp.

IvanPizhenko commented 3 years ago

Here's what latest version of email.js does:

                'content-disposition': attachment.inline
                    ? 'inline'
                    : `attachment; filename="${mimeWordEncode(
                            attachment.name as string
                      )}"`,

where mimeWordEncode can be found here: https://github.com/eleith/emailjs/blob/99cf10fea8da904e71e8c6ceb740edf727e8716c/smtp/mime.ts#L186

IvanPizhenko commented 3 years ago

RFC 2231 says:

attribute-char := <any (US-ASCII) CHAR except SPACE, CTLs,
                     "*", "'", "%", or tspecials>

Then RFC 2045 says:

 tspecials :=  "(" / ")" / "<" / ">" / "@" /
                   "," / ";" / ":" / "\" / <">
                   "/" / "[" / "]" / "?" / "="

So seems like implementation is correct. But Thunderbird and Gmail don't understand it.

Maybe it is better to switch to RFC 2047 method?

encoded-word = "=?" charset "?" encoding "?" encoded-text "?="

   charset = token    ; see section 3

   encoding = token   ; see section 4

   token = 1*<Any CHAR except SPACE, CTLs, and especials>

   especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "
               <"> / "/" / "[" / "]" / "?" / "." / "="

   encoded-text = 1*<Any printable ASCII character other than "?"
                     or SPACE>
                  ; (but see "Use of encoded-words in message
                  ; headers", section 5)

tomholub commented 3 years ago

Thank you for all of these comments - I'll let you and Roman settle this together.

IvanPizhenko commented 3 years ago

@rrrooommmaaa So what do you think? How should we solve this? Please read about some investigation in the previous comments.

rrrooommmaaa commented 3 years ago

Please explain why is it needed to involve URL encoding here? (Thunderbird seems to do something different). ... As you can see, there no special encoding at all, even for those ( and ).

That depends on the receiver's client implementation (which may be buggy because continuation encoding isn't very simple and obvious) -- this is why I encoded ( and ) -- to allow it to be decoded properly on the receiver's subsystem. Moreover, I saw a sender's implementation (was it Gmail?) where EACH character was encoded as %XX -- this made the message header considerably bigger, but it is still better than guessing which particular character is not supported (probably ' in this issue). So you can try to send a message where each character in continuation-encoded filename is %-escaped and see whether this helps.

RFC2231 continuations https://datatracker.ietf.org/doc/html/rfc2231#section-3 is mainly used for long filenames etc. It's ok to use it, gmail uses it (though encoding every character as %XX if I remember correctly)

IvanPizhenko commented 3 years ago

Here we in fact have RFC2231, but it is not recognized by Gmail properly. I suggest to use "=XX" encoding, each character.

rrrooommmaaa commented 3 years ago

Here we in fact have RFC2231, but it is not recognized by Gmail properly. I suggest to use "=XX" encoding, each character.

Is this behaviour when downloading from gmail page?

IvanPizhenko commented 3 years ago

Here we in fact have RFC2231, but it is not recognized by Gmail properly. I suggest to use "=XX" encoding, each character.

Is this behaviour when downloading from gmail page?

@rrrooommmaaa In Thunderbird: File name shows up mostly correctly, but has extra space after ?: original file name was what's_up?.txt and Thunderbird gives me what's_up? .txt.pgp when I try to download it.

In Gmail web interface: Here is what I can see: When attempting to download it, the file name in the Save dialog appears to be "what"'s_up%3F.txt.

IvanPizhenko commented 2 years ago

@rrrooommmaaa Any further comment on this?

rrrooommmaaa commented 2 years ago

%XX for each character is acceptable. Some clients (Gmail?) do this, perhaps for a good reason.

FlowCrypt / flowcrypt-browser

Special characters in attachment filenames don't work #3505

Steps to reproduce: