akka / akka-http

The Streaming-first HTTP server/module of Akka
https://doc.akka.io/docs/akka-http
Other
1.34k stars 594 forks source link

`Content-Disposition` header parsing for `form-data` doesn't follow RFC7578 #4410

Open akatashev opened 3 months ago

akatashev commented 3 months ago

Actual behaviour:

When Akka HTTP parses Content-Disposition header, it follows RFC6266 and applies RFC5987 encoding to non-ASCII characters in its filename field.

I.e., when it gets a Content-Disposition header with filename field that contains non-ASCII characters, it generates a UTF-8 encoded filename* field contained filename content in UTF-8, as proposed in RFC5987.

In addition to this, it converts all non-ASCII characters to ? in the original filename field, following this RFC6266 recommendation:

When a "filename" parameter is sent, to also generate a "filename" parameter as a fallback for user agents that do not support the "filename" form, if possible. This can be done by substituting characters with US-ASCII sequences (e.g., Unicode character point U+00E4 (LATIN SMALL LETTER A WITH DIARESIS) by "ae"). Note that this may not be possible in some locales.

https://github.com/akka/akka-http/blob/7638ab4ea515904c2edb9444eee7549aea982f51/akka-http-core/src/main/scala/akka/http/scaladsl/model/headers/headers.scala#L491-L499

As the result, if I try to send a multipart request with some non-ASCII characters in the filename field via CURL, CURL itself sends something like this:

Content-Disposition: form-data; name="test0"; filename="my_файл_123!.txt"\r\n

And when Akka HTTP parses it, it modifies it this way:

Content-Disposition: form-data; filename="my_????_123!.txt"; filename*=UTF-8''my_%D1%84%D0%B0%D0%B9%D0%BB_123!.txt; name="test0"

Notice that all non-ASCII characters were turned to ?. If my filename contained only non-ASCII characters, then the resulting filename would be just ????.txt, regardless of whether that's файл.txt or лайф.txt.

The issue:

The latest HTML5 standard says:

For details on how to interpret multipart/form-data payloads, see RFC 7578. [RFC7578]

And RFC7578 strictly forbids using the usage of RFC5987 for filename field of Content-Disposition header in form-data case:

NOTE: The encoding method described in [RFC5987], which would add a "filename*" parameter to the Content-Disposition header field, MUST NOT be used.

Instead it proposes to use percent-encoding:

In most multipart types, the MIME header fields in each part are restricted to US-ASCII; for compatibility with those systems, file names normally visible to users MAY be encoded using the percent-encoding method in Section 2, following how a "file:" URI [URI-SCHEME] might be encoded.

And this percent-encoding is described this way:

Within this specification, "percent-encoding" (as defined in [RFC3986]) is offered as a possible way of encoding characters in file names that are otherwise disallowed, including non-ASCII characters, spaces, control characters, and so forth. The encoding is created replacing each non-ASCII or disallowed character with a sequence, where each byte of the UTF-8 encoding of the character is represented by a percent-sign (%) followed by the (case-insensitive) hexadecimal of that byte.

There are some clients that follow this standard, so they don't expect filename* field anymore, since it's strictly forbidden. And they expect to see percent-encoding in filename field. If non-ASCII characters in filename are just replaced with some generic placeholder, it could cause issues, because any file whose filename contains of 4 non-ASCII characters, would be just ???? for these clients.

Proposals:

Unfortunately there is no standard approach to solving this issue. Other libraries, like http4s, playframework, etc, use slightly different approaches. I think that generally there are two ways to improve the situation:

Use RFC7578 approach

It would probably be the "right" thing to do, but fairly dangerous, because it would break backwards compatibility for legacy clients that rely on filename* field, i.e. RFC5987 approach. It clearly is not the desired outcome.

Keep filename*, but apply percent-encoding to filename

This would still violate RFC7578, which says MUST NOT about using RFC5987 encoding method. Though, it would at least unblock clients that expect filename field to be percent-encoded.

But I am not sure which approach would be the best. Probably it deserves some community discussion to figure out the best way to move forward and resolve the issue.

johanandren commented 3 months ago

Thanks for the detailed report.

I think we should go with percent-encoding just like playframework decided to do if I read that PR right.