ampproject / amppackager

Tool to improve AMP URLs via Signed Exchanges
https://amp.dev/documentation/guides-and-tutorials/optimize-and-measure/signed-exchange/
Apache License 2.0
140 stars 55 forks source link

Figure out proper URL-escaping once and for all #191

Open twifkak opened 6 years ago

twifkak commented 6 years ago

Neither url.Parse() nor url.String() sanitizes the query component. See e.g. https://play.golang.org/p/8hsD2WeMVYD

190 and #192 fix some cases resulting from this, but there may be other edge cases that fall out from this quick fix. Investigate https://golang.org/src/net/url/url.go?s=7976:8008#L96 and see how it compares to what we should be doing per https://tools.ietf.org/html/rfc3986.

twifkak commented 5 years ago

https://tools.ietf.org/html/rfc3986#section-3.4 says query = *( pchar / "/" / "?" ). Encoding a query string as a path component would extraneously escape / and ?. This seems fine.

In addition, per https://tools.ietf.org/html/rfc3986#section-3.3, pchar may or may not contain :, depending on context. Golang acts conservatively and always escapes it. This seems fine, too.

AFAICT, nothing remains unescaped that should be escaped.

twifkak commented 5 years ago

https://play.golang.org/p/aVV6I6q-mPV demonstrates that > in the host goes unescaped. This is definitely in violation of the ABNF for host at https://tools.ietf.org/html/rfc3986#section-3.2.2, and will likely break Link header parsers.

alin04 commented 5 years ago

Go parser explicitly allows "<", and ">".

https://golang.org/src/net/url/url.go?s=2590:2650#L100

alin04 commented 5 years ago

I wonder if a simple post processor that globally replaces '<', '>' on the result from url.String() will suffice? That would escape it throughout (host, path, query, fragment).

We can't escape the host value directly on the Url struct after Parse and before String because the pct-encoding itself gets encoded yet again. e.g. https://play.golang.org/p/hCNEKynIvXl

twifkak commented 5 years ago

Wow, there's no way to represent a pct-encoded host in Go's url package? That's frustrating. Yeah, I guess a post-processor would work. Are there any other magic characters we need to worry about?

twifkak commented 5 years ago

URLs get sanitized in a few different places: https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/packager/signer/validation.go#L16 https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/packager/signer/signer.go#L384 https://github.com/ampproject/amppackager/blob/73dec5d4648395b1ce47deb9d0065c2c719c8986/transformer/internal/amphtml/urls.go#L54

I need to do an audit of all of these, and hopefully unify some of the code where requirements are aligned.

twifkak commented 4 years ago

Unfortunately, I'm really far away from being able to prioritize this. My WIP is at https://github.com/twifkak/amppackager/tree/url_escaping. I think the way forward is to implement our own URL parser (for the subset of the spec that we care about, e.g. http/https).

Some notes from my attempt:

Gregable commented 4 years ago

Just a random drive-by comment without understanding the full scope. It might be possible to port https://github.com/ampproject/amphtml/blob/master/validator/js/engine/parse-url.js fairly easily (it's fairly minimal and doesn't have any javascript-specific bits or dependencies).

twifkak commented 4 years ago

Thanks for the tip. I'm not familiar with parse-url.js but its test set looks pretty light compared to the branching factor of RFC3986, so I fear it may have some edge case bugs, too. I'd prefer to start from RFC3986 (or a library that hews closely to it) as a base. Note that the SXG inner = outer requirement drives, I think, stricter requirements than is needed by the AMP JS validator. (Edit: also the content sniffing concern.)

I think I would avoid the whatwg URL spec because that tries to map to what browsers do (including all sorts of semantic-breaking transformations that better meet expectations of users typing non-compliant URLs).

My recollection is that libcurl is somewhere in-between those two specs.

Gregable commented 4 years ago

Fair enough, drive-by comment retracted. :)