Closed seisvelas closed 2 years ago
We do remove some naughty characters when sending, but what's_up_.txt
is not the best.
The easy solution is to just URL decode the filename, so that what's_up_.txt
becomes what's_up_.txt
. But this is naive because what if the original filename actually was intended to be what's_up_.txt
. There's no way to distinguish.
Unless we surround special characters with some kind of signifier to know that whatever we see there ought to be decoded (kinda like how Common Lisp & Clojure use GENSYM
macros to achieve good-enough hygiene).
But I think that's overkill and we should just URL decode them. Not many people intend a file name to have '
in it. And if users do complain about it, we can always implement hygiene later.
So basically, my solution is to URL decode filenames. Which should be really easy, so just let me know if you want me to go ahead and make a PR :)
Let's have @rrrooommmaaa have a look later - he was working on filenames recently, and may have a better understanding then us two about the proper encoding and decoding. It seems to me that we may want to adjust both ends.
@seisvelas Or :) if you want to take a look - see what other email clients are doing. What does Gmail or Thunderbird do, for example.
Here's how Thunderbird on Linux sends such file:
--------------2BD8AF7FAFE1FD10FF4173DB
Content-Type: text/plain; charset=UTF-8;
name="what's_up?.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="what's_up?.txt"
...
--------------2BD8AF7FAFE1FD10FF4173DB--
And here is what Secure Compose does:
------sinikael-?=_1-16320706293100.10230587302860561
Content-Type: application/octet-stream; name="what's_up?.txt.pgp"
Content-Disposition: attachment; filename*0*=utf-8''what's_up%3F.txt.pgp
X-Attachment-Id: f_AMLYAEhORLSKqUdMxfiKFQKCXOTaUO@flowcrypt
Content-Id: <f_AMLYAEhORLSKqUdMxfiKFQKCXOTaUO@flowcrypt>
Content-Transfer-Encoding: base64
...
In the my test, after secure compose, file name deviated even more than @seisvelas has indicated above. Here is what I have got: "what"'s_up%3F.txt
I think it is produced by this function from email.js
library: https://github.com/FlowCrypt/flowcrypt-browser/blob/2147b43d04eac7132c7b559905254cd042808b30/extension/lib/emailjs/emailjs-mime-codec.js#L647
It is used by MimeNode.prototype._buildHeaderValue()
, which is described as "Joins parsed header value together as 'value; param1=value1; param2=value2'", which is used by MimeNode.prototype.build()
, which "Builds the rfc2822 message from the current node.". All this is in the email.js
library.
continuationEncode
has following following:
var continuationEncodeChr = function(chr) {
if (chr === '(') {
return '%28';
} else if (chr === ')') {
return '%29';
} else {
return encodeURIComponent(chr);
}
};
Call to encodeURIComponent()
was here originally, but above was introduced by @rrrooommmaaa to fix issue #3352.
Test case introduced in the #3357 reveals another interesting file name XX J 1 IT E (P 4) p_c.pdf
.
I will try to create file with similar name XX J 1 IT E (P 4) p_c.txt
and send it from Thunderbird, and see how it will encode it.
And here is content from Thunderbird:
--------------F41182E32C6B2C4A6F50A6E7
Content-Type: text/plain; charset=UTF-8;
name="XX J 1 IT E (P 4) p_c.txt"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="XX J 1 IT E (P 4) p_c.txt"
SGVsbG8sIHdvcmxkIQo=
--------------F41182E32C6B2C4A6F50A6E7--
As you can see, there no special encoding at all, even for those (
and )
.
@rrrooommmaaa @tomholub Please explain why is it needed to involve URL encoding here? (Thunderbird seems to do something different).
From that there is another question - maybe #3357 has fixed wrong thing?
Maybe whole idea of using encodeURIComponent()
is wrong and it should be encoded somehow differently?
Maybe we should look into source code of the Thunderbird or whatever third party library it uses and port to JS/TS from C or C++ exactly what it does? (or find some ready JS library which does it more correctly).
UPDATE
I have read a bit RFC822, and it says: "Each header field can be viewed as a single, logical line of ASCII characters, comprising a field-name and a field-body." Since both (
, )
, '
, ?
are valid ASCII characters, they should not need any special encoding in the email header. Then why do we apply URL encoding?
UPDATE
Read RFC 2231, seems like this URL encoding comes from there, but it seems to be something wrong with it.
Maybe it is Gmail bug? Seems like no. In Thunderbird it was decoded as what 's_up%3F.txt.pgp
.
Here's what latest version of email.js
does:
'content-disposition': attachment.inline
? 'inline'
: `attachment; filename="${mimeWordEncode(
attachment.name as string
)}"`,
where mimeWordEncode
can be found here: https://github.com/eleith/emailjs/blob/99cf10fea8da904e71e8c6ceb740edf727e8716c/smtp/mime.ts#L186
RFC 2231 says:
attribute-char := <any (US-ASCII) CHAR except SPACE, CTLs,
"*", "'", "%", or tspecials>
Then RFC 2045 says:
tspecials := "(" / ")" / "<" / ">" / "@" /
"," / ";" / ":" / "\" / <">
"/" / "[" / "]" / "?" / "="
So seems like implementation is correct. But Thunderbird and Gmail don't understand it.
Maybe it is better to switch to RFC 2047 method?
encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
charset = token ; see section 3
encoding = token ; see section 4
token = 1*<Any CHAR except SPACE, CTLs, and especials>
especials = "(" / ")" / "<" / ">" / "@" / "," / ";" / ":" / "
<"> / "/" / "[" / "]" / "?" / "." / "="
encoded-text = 1*<Any printable ASCII character other than "?"
or SPACE>
; (but see "Use of encoded-words in message
; headers", section 5)
Thank you for all of these comments - I'll let you and Roman settle this together.
@rrrooommmaaa So what do you think? How should we solve this? Please read about some investigation in the previous comments.
Please explain why is it needed to involve URL encoding here? (Thunderbird seems to do something different). ... As you can see, there no special encoding at all, even for those ( and ).
That depends on the receiver's client implementation (which may be buggy because continuation encoding isn't very simple and obvious) -- this is why I encoded (
and )
-- to allow it to be decoded properly on the receiver's subsystem.
Moreover, I saw a sender's implementation (was it Gmail?) where EACH character was encoded as %XX -- this made the message header considerably bigger, but it is still better than guessing which particular character is not supported (probably '
in this issue).
So you can try to send a message where each character in continuation-encoded filename is %-escaped and see whether this helps.
RFC2231 continuations https://datatracker.ietf.org/doc/html/rfc2231#section-3 is mainly used for long filenames etc. It's ok to use it, gmail uses it (though encoding every character as %XX if I remember correctly)
Here we in fact have RFC2231, but it is not recognized by Gmail properly. I suggest to use "=XX" encoding, each character.
Here we in fact have RFC2231, but it is not recognized by Gmail properly. I suggest to use "=XX" encoding, each character.
Is this behaviour when downloading from gmail page?
Here we in fact have RFC2231, but it is not recognized by Gmail properly. I suggest to use "=XX" encoding, each character.
Is this behaviour when downloading from gmail page?
@rrrooommmaaa
In Thunderbird:
File name shows up mostly correctly, but has extra space after ?
: original file name was what's_up?.txt
and Thunderbird gives me what's_up? .txt.pgp
when I try to download it.
In Gmail web interface:
Here is what I can see:
When attempting to download it, the file name in the Save dialog appears to be "what"'s_up%3F.txt
.
@rrrooommmaaa Any further comment on this?
%XX for each character is acceptable. Some clients (Gmail?) do this, perhaps for a good reason.
Steps to reproduce:
what's_up?.txt
what's_up_.txt
My guess would be that this filename is decoded from a URL, but a cursory glance at the relevant code didn't immediately reveal anything, so I'm still unsure.
I'll look deeper into this but in the meantime any ideas/suggestions are appreciated :)