Open annevk opened 3 years ago
Looking at this again and in particular https://html.spec.whatwg.org/#fetch-a-classic-script I think the simplest option here is that we pass the encoding along with the request and then we need to abstract or duplicate these steps (and maybe improve them while we're at it, especially getting the charset
parameter from the Content-Type
header):
- If response's Content Type metadata, if any, specifies a character encoding, and the user agent supports that encoding, then set character encoding to that encoding (ignoring the passed-in value).
- Let source text be the result of decoding response's body to Unicode, using character encoding as the fallback encoding.
- Let script be the result of creating a classic script given source text, settings object, response's url, options, and muted errors.
And then if script's record is null parsing failed.
@domenic does that seem right to you?
I don't have the full context on what security guarantees we're trying to preserve here (is it bad to leak information about the Content-Type
header?) but in terms of a spec refactoring, that seems reasonable.
and maybe improve them while we're at it, especially getting the charset parameter from the Content-Type header
Basically every usage of "Content-Type metadata" in HTML could be improved by using the new MIME type getter, I think.
One risk here is that the attacker has control over the encoding, so this technically gives them more opportunity to find a way to get something parsed as JavaScript. In practice it still seems hard to parse as JavaScript as the majority of significant bytes are in the ASCII range.
I included a fix for this in https://github.com/whatwg/fetch/pull/1442 which I think works. The HTML side will need to set it on requests, but that's a very straightforward change.
And while it is unfortunate that the fallback encoding is in the hands of the attacker, this is no different from the status quo.
I forgot that the response itself also carries encoding-related information. https://github.com/whatwg/fetch/pull/1447 tackles the first part of that. Once that lands it should be easy to call from Fetch's ORB PR.
We might not always have an encoding, e.g.,
fetch(..., { mode: "no-cors" })
. Is it reasonable to always use UTF-8 for this check?