erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.35k stars 2.95k forks source link

ERL-637: `uri_string:parse/1` rejects an URL which is accepted in Erlang 20 #3599

Closed OTP-Maintainer closed 3 years ago

OTP-Maintainer commented 6 years ago

Original reporter: dumbbell Affected version: OTP-21.0 Component: stdlib Migrated from: https://bugs.erlang.org/browse/ERL-637


The URL {{http://localhost/?param=^(?=^reg)}} used to be accepted by {{httpc:request/4}}:

{code}
1> httpc:request(get, {"http://localhost/?param=^(?=^reg)", []}, [], []).
{error,{failed_connect,[{to_address,{"localhost",80}},
                        {inet,[inet],econnrefused}]}}
{code}

However, it is now rejected by {{uri_string:parse/1}}:

{code}
1> httpc:request(get, {"http://localhost/?param=^(?=^reg)", []}, [], []).
** exception error: no function clause matching uri_string:parse({error,invalid_uri,":"}) (uri_string.erl, line 337)
     in function  httpc:request/5 (httpc.erl, line 179)
{code}

Removing the {{^}} characters "solves" the issue:

{code}
2> httpc:request(get, {"http://localhost/?param=(?=reg)", []}, [], []). 
{error,{failed_connect,[{to_address,{"localhost",80}},
                        {inet,[inet],econnrefused}]}}
{code}
I will look into this tomorrow (CEST timezone) if no one else beat me to it.
OTP-Maintainer commented 6 years ago

dumbbell said:

So {{^}} is an _unsafe_ character because it is neither a _reserved_ nor an _unreserved_ character, and thus should be percent-encoded by the URL producer. I couldn't find anything in RFC 3986, but there is an answer on Stackexchange (https://meta.stackexchange.com/a/69371) indicates it is explained in RFC 1738, an RFC which is updated by RFC 3986.

This new parser in Erlang 21 is stricter and that's fine with us :-) I will update our testsuite.

The question is: should this change of behavior be documented?
OTP-Maintainer commented 6 years ago

peterdmv said:

The problem here is that the character "^" is outside of the US-ASCII character set and the parser considers it invalid. In my interpretation this is aligned with the standard:

*RFC3986, 1.2.1:*
_In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification.  Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced._

*RFC3986, 2.1*
_A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component._

To apply proper percent-encoding you can use the recompose function:

> uri_string:recompose(#{scheme => "http", host => "localhost", path => "/", query => "param=^(?=^reg)"}).
"http://localhost/?param=%5E(?=%5Ereg)"

_uri_string:parse_ can split this URI into components:

> uri_string:parse("http://localhost/?param=%5E(?=%5Ereg)").
#{host => "localhost",path => "/",
  query => "param=%5E(?=%5Ereg)",scheme => "http"}

_uri_string:normalize_ can parse it and transform it into its original unicode representation in one go:

> uri_string:normalize("http://localhost/?param=%5E(?=%5Ereg)", [return_map]).
#{host => "localhost",path => "/",query => "param=^(?=^reg)",
  scheme => "http"}
OTP-Maintainer commented 6 years ago

dumbbell said:

Yes, I agree with your comment. Sorry I erroneously posted my previous comment before it was complete...
OTP-Maintainer commented 6 years ago

dumbbell said:

As an additional data point, I tried this kind of URL with Firefox, Chromium and elinks: none of them percent-encoded {{^}}. I understand it's a bit different: they are web browsers, thus quite specific to HTTP (and FTP perhaps), whereas {{uri_string}} is a generic URI toolbox.
OTP-Maintainer commented 6 years ago

essen said:

The "^" character *is* part of US-ASCII. But it's a character that should be percent encoded when in path segments (but not in userinfo components, according to https://url.spec.whatwg.org/#percent-encoded-bytes).

That being said I'm not sure it's a good idea for httpc to refuse to perform requests that have "^" in a path segment. That's the server's job to reject these requests if necessary. Users may very well want to send a specific non-standard path that is expected by a broken server. Perhaps there's a non-strict option?
OTP-Maintainer commented 6 years ago

peterdmv said:

There is no such option as far as I know, but I agree that it might be necessary to have one in the near future. Not sure if such a change should be implemented in the uri_string module or in a new module that strictly follows the WHATWG URL standard. The big question here if we should replace httpc with a more capable HTTP client.