Closed davechallis closed 5 years ago
This is related to https://github.com/actix/actix-web/blob/master/src/uri.rs#L35:
lazy_static! {
static ref DEFAULT_QUOTER: Quoter = { Quoter::new(b"@:", b"/+") };
}
+
is a protected character when decoding %
-encoded strings.
It was introduced in #182
Was this an intentional change or a bug introduced? If deliberate, it's definitely one area where actix-web differs from almost all other web frameworks I've used, where all percent encoded strings in paths are decoded.
Currently it means that there's no way to differentiate between a %
-encoded +
, and a %
-encoded %
followed by the characters 2B
, e.g. using the server example above, both of these queries give an identical response from path decoding:
$ curl '127.0.0.1:8080/%2B'
%2B
$ curl '127.0.0.1:8080/%252B'
%2B
So manually adding an additional decoding step would be needed to get the correct path from the 1st example, but then it would give the wrong result for the 2nd example.
@mockersf Do you think there are risks if we would stop retaining encoded +
?
Looking at the code linked to, I guess the same issue applies to quoting of /
too, e.g. there's no way to differentiate between:
$ curl 'localhost:8080/%2F'
%2F
$ curl 'localhost:8080/%252F'
%2F
Relevant RFC : https://tools.ietf.org/html/rfc3986
There is an "unreserved" character set: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
(https://tools.ietf.org/html/rfc3986#section-2.3)
From https://tools.ietf.org/html/rfc3986#section-2.4:
When a URI is dereferenced, the components and subcomponents
significant to the scheme-specific dereferencing process (if any)
must be parsed and separated before the percent-encoded octets within
those components can be safely decoded, as otherwise the data may be
mistaken for component delimiters. The only exception is for
percent-encoded octets corresponding to characters in the unreserved
set, which can be decoded at any time.
So this would mean we should:
Path
extractor, decode remaining percent encoded charactersDo you agree with my reading of the RFC?
@mockersf Thanks for RFC links, I'll read it later on.
But with your suggestion it would imply that we would need to re-allocate path again, after we routed it Which may not be always desirable performance wise.
@mockersf yup, that looks right. There's also the alternative option of decoding the unreserved set characters alongside the rest of the characters, but that really depends on how routing is done (e.g. if done by looking at split URI parts, then decoding can be done later, if done using regex, then unreserved chars need to be decoded earlier).
@DoumanAsh aside from performance, I think it should be considered a bug in the current implementation, as there's no way to reliably obtain path components as provided in all cases.
This occurred in a real life use case for me recently, when trying to port a simple cache service from python to actix-web.
The service encodes full URLs as a path component, so to get a cached version of e.g.:
http://localhost:80/foo
the cache service (which I'm porting to actix-web) is queried with:
http://cache-service/http%3A%2F%2Flocalhost%3A80%2Ffoo
Actix-web currently decodes this path component as:
http:%2F%2Flocalhost:80%2Ffoo
In this simple case, doing additional manual decoding worked fine. The problem occurs when the original URL also contains URL encoded characters, e.g. when the original URL is something like:
http://localhost:80/file/%2Fvar%2Flog%2Fsyslog
This is then URL encoded to use as a path component, so the actix-web cache service is queried with:
http://cache-service/http%3A%2F%2Flocalhost%3A80%2Ffile%2F%252Fvar%252Flog%252Fsyslog%0A
(i.e. the URL encoded chars from the original URL are encoded again)
Actix-web decodes that path component as:
http:%2F%2Flocalhost:80%2Ffile%2F%2Fvar%2Flog%2Fsyslog
which is not the original URL I provided. If I manually decode again, then I still end up with an incorrect URL:
http://localhost:80/file//var/log/syslog
So it's currently impossible to give any valid encoded path to actix-web and reliably get it back out again, it's only possible if actix-web itself always fully decodes all characters as part of its path parsing.
@mockersf Sorry, I re-read you suggestion again, and the thing about Path
extractor makes sense I guess as it would be optional. Just user would need to be aware of this implication.
I still haven't rad RFC as I'm at work, but I'll go to it after work. My opinion is that we should follow RFC overall. So your proposal makes ense
User opinion on ' '
in routes would be welcome here https://github.com/actix/actix-web/pull/577#issuecomment-437551810
Path extractor now follows RFC #577
When decoding using the
Path
extractor, if the path contains%2B
, then it's extracted as is, instead of decoding to a+
character.E.g. using a minimal example server:
Then querying with e.g. the following URL encoded characters
@£$%^&+=
(%40%C2%A3%24%25%5E%26%2B%3D
) results in all of the except for%2B
being decoded by actix:Expected:
@£$%^&+=