URL encoded '+' character (%2B) not decoded by path extractor

davechallis commented 6 years ago

When decoding using the Path extractor, if the path contains %2B, then it's extracted as is, instead of decoding to a + character.

E.g. using a minimal example server:

extern crate actix;
extern crate actix_web;
use actix_web::{Path, server, App, FromRequest, HttpRequest};

fn index(req: &HttpRequest) -> String { Path::<String>::extract(req).unwrap().into_inner() }

fn main() {
    let sys = actix::System::new("foo");
    server::new(|| App::new() .resource("/{path}", |r| r.f(index)))
        .bind("127.0.0.1:8080").unwrap().start();
    let _ = sys.run();
}

Then querying with e.g. the following URL encoded characters @£$%^&+= (%40%C2%A3%24%25%5E%26%2B%3D) results in all of the except for %2B being decoded by actix:

$ curl '127.0.0.1:8080/%40%C2%A3%24%25%5E%26%2B%3D'
@£$%^&%2B=

Expected: @£$%^&+=

mockersf commented 6 years ago

lazy_static! {
    static ref DEFAULT_QUOTER: Quoter = { Quoter::new(b"@:", b"/+") };
}

+ is a protected character when decoding %-encoded strings.

It was introduced in #182

davechallis commented 6 years ago

Was this an intentional change or a bug introduced? If deliberate, it's definitely one area where actix-web differs from almost all other web frameworks I've used, where all percent encoded strings in paths are decoded.

Currently it means that there's no way to differentiate between a %-encoded +, and a %-encoded % followed by the characters 2B, e.g. using the server example above, both of these queries give an identical response from path decoding:

$ curl '127.0.0.1:8080/%2B'
%2B

$ curl '127.0.0.1:8080/%252B'
%2B

So manually adding an additional decoding step would be needed to get the correct path from the 1st example, but then it would give the wrong result for the 2nd example.

DoumanAsh commented 6 years ago

@mockersf Do you think there are risks if we would stop retaining encoded +?

davechallis commented 6 years ago

Looking at the code linked to, I guess the same issue applies to quoting of / too, e.g. there's no way to differentiate between:

$ curl 'localhost:8080/%2F'
%2F

$ curl 'localhost:8080/%252F'
%2F

mockersf commented 6 years ago

Relevant RFC : https://tools.ietf.org/html/rfc3986

There is an "unreserved" character set: unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" (https://tools.ietf.org/html/rfc3986#section-2.3) From https://tools.ietf.org/html/rfc3986#section-2.4:

   When a URI is dereferenced, the components and subcomponents
   significant to the scheme-specific dereferencing process (if any)
   must be parsed and separated before the percent-encoded octets within
   those components can be safely decoded, as otherwise the data may be
   mistaken for component delimiters.  The only exception is for
   percent-encoded octets corresponding to characters in the unreserved
   set, which can be decoded at any time.

So this would mean we should:

when receiving a URI, decode all percent-encoded characters from the unreserved set, and only those
use this decoded URI to match routes and extract path components
in the Path extractor, decode remaining percent encoded characters

Do you agree with my reading of the RFC?

DoumanAsh commented 6 years ago

@mockersf Thanks for RFC links, I'll read it later on.

But with your suggestion it would imply that we would need to re-allocate path again, after we routed it Which may not be always desirable performance wise.

davechallis commented 6 years ago

@mockersf yup, that looks right. There's also the alternative option of decoding the unreserved set characters alongside the rest of the characters, but that really depends on how routing is done (e.g. if done by looking at split URI parts, then decoding can be done later, if done using regex, then unreserved chars need to be decoded earlier).

@DoumanAsh aside from performance, I think it should be considered a bug in the current implementation, as there's no way to reliably obtain path components as provided in all cases.

This occurred in a real life use case for me recently, when trying to port a simple cache service from python to actix-web.

The service encodes full URLs as a path component, so to get a cached version of e.g.:

http://localhost:80/foo

the cache service (which I'm porting to actix-web) is queried with:

http://cache-service/http%3A%2F%2Flocalhost%3A80%2Ffoo

Actix-web currently decodes this path component as:

http:%2F%2Flocalhost:80%2Ffoo

In this simple case, doing additional manual decoding worked fine. The problem occurs when the original URL also contains URL encoded characters, e.g. when the original URL is something like:

http://localhost:80/file/%2Fvar%2Flog%2Fsyslog

This is then URL encoded to use as a path component, so the actix-web cache service is queried with:

http://cache-service/http%3A%2F%2Flocalhost%3A80%2Ffile%2F%252Fvar%252Flog%252Fsyslog%0A

(i.e. the URL encoded chars from the original URL are encoded again)

Actix-web decodes that path component as:

http:%2F%2Flocalhost:80%2Ffile%2F%2Fvar%2Flog%2Fsyslog

which is not the original URL I provided. If I manually decode again, then I still end up with an incorrect URL:

   http://localhost:80/file//var/log/syslog

So it's currently impossible to give any valid encoded path to actix-web and reliably get it back out again, it's only possible if actix-web itself always fully decodes all characters as part of its path parsing.

DoumanAsh commented 6 years ago

@mockersf Sorry, I re-read you suggestion again, and the thing about Path extractor makes sense I guess as it would be optional. Just user would need to be aware of this implication.

I still haven't rad RFC as I'm at work, but I'll go to it after work. My opinion is that we should follow RFC overall. So your proposal makes ense

DoumanAsh commented 6 years ago

User opinion on ' ' in routes would be welcome here https://github.com/actix/actix-web/pull/577#issuecomment-437551810

DoumanAsh commented 5 years ago

Path extractor now follows RFC #577

actix / actix-web

URL encoded '+' character (%2B) not decoded by path extractor #556