Open twifkak opened 6 years ago
https://tools.ietf.org/html/rfc3986#section-3.4 says query = *( pchar / "/" / "?" )
. Encoding a query string as a path component would extraneously escape /
and ?
. This seems fine.
In addition, per https://tools.ietf.org/html/rfc3986#section-3.3, pchar
may or may not contain :
, depending on context. Golang acts conservatively and always escapes it. This seems fine, too.
AFAICT, nothing remains unescaped that should be escaped.
https://play.golang.org/p/aVV6I6q-mPV demonstrates that >
in the host goes unescaped. This is definitely in violation of the ABNF for host at https://tools.ietf.org/html/rfc3986#section-3.2.2, and will likely break Link header parsers.
Go parser explicitly allows "<", and ">".
I wonder if a simple post processor that globally replaces '<', '>' on the result from url.String() will suffice? That would escape it throughout (host, path, query, fragment).
We can't escape the host value directly on the Url struct after Parse and before String because the pct-encoding itself gets encoded yet again. e.g. https://play.golang.org/p/hCNEKynIvXl
Wow, there's no way to represent a pct-encoded host in Go's url package? That's frustrating. Yeah, I guess a post-processor would work. Are there any other magic characters we need to worry about?
URLs get sanitized in a few different places: https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/packager/signer/validation.go#L16 https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/packager/signer/signer.go#L384 https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/transformer/internal/amphtml/urls.go#L54
I need to do an audit of all of these, and hopefully unify some of the code where requirements are aligned.
Unfortunately, I'm really far away from being able to prioritize this. My WIP is at https://github.com/twifkak/amppackager/tree/url_escaping. I think the way forward is to implement our own URL parser (for the subset of the spec that we care about, e.g. http/https).
Some notes from my attempt:
?
, so we would need to keep some side state.Just a random drive-by comment without understanding the full scope. It might be possible to port https://github.com/ampproject/amphtml/blob/master/validator/js/engine/parse-url.js fairly easily (it's fairly minimal and doesn't have any javascript-specific bits or dependencies).
Thanks for the tip. I'm not familiar with parse-url.js but its test set looks pretty light compared to the branching factor of RFC3986, so I fear it may have some edge case bugs, too. I'd prefer to start from RFC3986 (or a library that hews closely to it) as a base. Note that the SXG inner = outer requirement drives, I think, stricter requirements than is needed by the AMP JS validator. (Edit: also the content sniffing concern.)
I think I would avoid the whatwg URL spec because that tries to map to what browsers do (including all sorts of semantic-breaking transformations that better meet expectations of users typing non-compliant URLs).
My recollection is that libcurl is somewhere in-between those two specs.
Fair enough, drive-by comment retracted. :)
Neither url.Parse() nor url.String() sanitizes the query component. See e.g. https://play.golang.org/p/8hsD2WeMVYD
190 and #192 fix some cases resulting from this, but there may be other edge cases that fall out from this quick fix. Investigate https://golang.org/src/net/url/url.go?s=7976:8008#L96 and see how it compares to what we should be doing per https://tools.ietf.org/html/rfc3986.