Closed OTP-Maintainer closed 3 years ago
dumbbell
said:
So {{^}} is an _unsafe_ character because it is neither a _reserved_ nor an _unreserved_ character, and thus should be percent-encoded by the URL producer. I couldn't find anything in RFC 3986, but there is an answer on Stackexchange (https://meta.stackexchange.com/a/69371) indicates it is explained in RFC 1738, an RFC which is updated by RFC 3986.
This new parser in Erlang 21 is stricter and that's fine with us :-) I will update our testsuite.
The question is: should this change of behavior be documented?
peterdmv
said:
The problem here is that the character "^" is outside of the US-ASCII character set and the parser considers it invalid. In my interpretation this is aligned with the standard:
*RFC3986, 1.2.1:*
_In local or regional contexts and with improving technology, users might benefit from being able to use a wider range of characters; such use is not defined by this specification. Percent-encoded octets (Section 2.1) may be used within a URI to represent characters outside the range of the US-ASCII coded character set if this representation is allowed by the scheme or by the protocol element in which the URI is referenced._
*RFC3986, 2.1*
_A percent-encoding mechanism is used to represent a data octet in a component when that octet's corresponding character is outside the allowed set or is being used as a delimiter of, or within, the component._
To apply proper percent-encoding you can use the recompose function:
> uri_string:recompose(#{scheme => "http", host => "localhost", path => "/", query => "param=^(?=^reg)"}).
"http://localhost/?param=%5E(?=%5Ereg)"
_uri_string:parse_ can split this URI into components:
> uri_string:parse("http://localhost/?param=%5E(?=%5Ereg)").
#{host => "localhost",path => "/",
query => "param=%5E(?=%5Ereg)",scheme => "http"}
_uri_string:normalize_ can parse it and transform it into its original unicode representation in one go:
> uri_string:normalize("http://localhost/?param=%5E(?=%5Ereg)", [return_map]).
#{host => "localhost",path => "/",query => "param=^(?=^reg)",
scheme => "http"}
dumbbell
said:
Yes, I agree with your comment. Sorry I erroneously posted my previous comment before it was complete...
dumbbell
said:
As an additional data point, I tried this kind of URL with Firefox, Chromium and elinks: none of them percent-encoded {{^}}. I understand it's a bit different: they are web browsers, thus quite specific to HTTP (and FTP perhaps), whereas {{uri_string}} is a generic URI toolbox.
essen
said:
The "^" character *is* part of US-ASCII. But it's a character that should be percent encoded when in path segments (but not in userinfo components, according to https://url.spec.whatwg.org/#percent-encoded-bytes).
That being said I'm not sure it's a good idea for httpc to refuse to perform requests that have "^" in a path segment. That's the server's job to reject these requests if necessary. Users may very well want to send a specific non-standard path that is expected by a broken server. Perhaps there's a non-strict option?
peterdmv
said:
There is no such option as far as I know, but I agree that it might be necessary to have one in the near future. Not sure if such a change should be implemented in the uri_string module or in a new module that strictly follows the WHATWG URL standard. The big question here if we should replace httpc with a more capable HTTP client.
Original reporter:
dumbbell
Affected version:OTP-21.0
Component:stdlib
Migrated from: https://bugs.erlang.org/browse/ERL-637