When Akka HTTP parses Content-Disposition header, it follows RFC6266 and applies RFC5987 encoding to non-ASCII characters in its filename field.
I.e., when it gets a Content-Disposition header with filename field that contains non-ASCII characters, it generates a UTF-8 encoded filename* field contained filename content in UTF-8, as proposed in RFC5987.
In addition to this, it converts all non-ASCII characters to ? in the original filename field, following this RFC6266 recommendation:
When a "filename" parameter is sent, to also generate a "filename" parameter as a fallback for user agents that do not support the "filename" form, if possible. This can be done by substituting characters with US-ASCII sequences (e.g., Unicode character point U+00E4 (LATIN SMALL LETTER A WITH DIARESIS) by "ae"). Note that this may not be possible in some locales.
As the result, if I try to send a multipart request with some non-ASCII characters in the filename field via CURL, CURL itself sends something like this:
Notice that all non-ASCII characters were turned to ?. If my filename contained only non-ASCII characters, then the resulting filename would be just ????.txt, regardless of whether that's файл.txt or лайф.txt.
And RFC7578 strictly forbids using the usage of RFC5987 for filename field of Content-Disposition header in form-data case:
NOTE: The encoding method described in [RFC5987], which would add a "filename*" parameter to the Content-Disposition header field, MUST NOT be used.
Instead it proposes to use percent-encoding:
In most multipart types, the MIME header fields in each part are restricted to US-ASCII; for compatibility with those systems, file names normally visible to users MAY be encoded using the percent-encoding method in Section 2, following how a "file:" URI [URI-SCHEME] might be encoded.
And this percent-encoding is described this way:
Within this specification, "percent-encoding" (as defined in [RFC3986]) is offered as a possible way of encoding characters in file names that are otherwise disallowed, including non-ASCII characters, spaces, control characters, and so forth. The encoding is created replacing each non-ASCII or disallowed character with a sequence, where each byte of the UTF-8 encoding of the character is represented by a percent-sign (%) followed by the (case-insensitive) hexadecimal of that byte.
There are some clients that follow this standard, so they don't expect filename* field anymore, since it's strictly forbidden. And they expect to see percent-encoding in filename field. If non-ASCII characters in filename are just replaced with some generic placeholder, it could cause issues, because any file whose filename contains of 4 non-ASCII characters, would be just ???? for these clients.
Proposals:
Unfortunately there is no standard approach to solving this issue. Other libraries, like http4s, playframework, etc, use slightly different approaches. I think that generally there are two ways to improve the situation:
Use RFC7578 approach
It would probably be the "right" thing to do, but fairly dangerous, because it would break backwards compatibility for legacy clients that rely on filename* field, i.e. RFC5987 approach. It clearly is not the desired outcome.
Keep filename*, but apply percent-encoding to filename
This would still violate RFC7578, which says MUST NOT about using RFC5987 encoding method. Though, it would at least unblock clients that expect filename field to be percent-encoded.
But I am not sure which approach would be the best. Probably it deserves some community discussion to figure out the best way to move forward and resolve the issue.
Actual behaviour:
When Akka HTTP parses
Content-Disposition
header, it follows RFC6266 and applies RFC5987 encoding to non-ASCII characters in itsfilename
field.I.e., when it gets a
Content-Disposition
header withfilename
field that contains non-ASCII characters, it generates a UTF-8 encodedfilename*
field containedfilename
content in UTF-8, as proposed in RFC5987.In addition to this, it converts all non-ASCII characters to
?
in the originalfilename
field, following this RFC6266 recommendation:https://github.com/akka/akka-http/blob/7638ab4ea515904c2edb9444eee7549aea982f51/akka-http-core/src/main/scala/akka/http/scaladsl/model/headers/headers.scala#L491-L499
As the result, if I try to send a multipart request with some non-ASCII characters in the
filename
field via CURL, CURL itself sends something like this:And when Akka HTTP parses it, it modifies it this way:
Notice that all non-ASCII characters were turned to
?
. If my filename contained only non-ASCII characters, then the resultingfilename
would be just????.txt
, regardless of whether that'sфайл.txt
orлайф.txt
.The issue:
The latest HTML5 standard says:
And RFC7578 strictly forbids using the usage of RFC5987 for
filename
field ofContent-Disposition
header inform-data
case:Instead it proposes to use percent-encoding:
And this percent-encoding is described this way:
There are some clients that follow this standard, so they don't expect
filename*
field anymore, since it's strictly forbidden. And they expect to see percent-encoding infilename
field. If non-ASCII characters infilename
are just replaced with some generic placeholder, it could cause issues, because any file whose filename contains of 4 non-ASCII characters, would be just????
for these clients.Proposals:
Unfortunately there is no standard approach to solving this issue. Other libraries, like http4s, playframework, etc, use slightly different approaches. I think that generally there are two ways to improve the situation:
But I am not sure which approach would be the best. Probably it deserves some community discussion to figure out the best way to move forward and resolve the issue.