apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.79k stars 353 forks source link

Cannot use protocol-relative URL in script src attribute #531

Closed ronosm closed 2 years ago

ronosm commented 2 years ago

It's not possible to add a protocol-relative URL in the src attribute from script tag.

At naughtyHref function it is handled, but after that this code is executed:

const parsed = new URL(value);

This code generates an exception when a protocol-relative URL is used, generating an exception making allowed value as false:

} catch (e) { allowed = false; }

Finally, it makes the src attribute is removed from script tag.

This happens between lines 325 and 355.

gmarinov commented 2 years ago

+1

here:

https://github.com/apostrophecms/sanitize-html/blob/3cdc262/index.js#L333-L338

while new URL('//my.url') breaks without protocol, i don't think it should be up to sanitize-html to enforce the scheme of URLs when no allowedScriptHostnames is defined?

i.e. const parsed = new URL(value); should move inside the if block?

https://github.com/apostrophecms/sanitize-html/blob/3cdc262/index.js#L340-L348

edit: that probably applies to iframe src too

tag @yorickgirard

boutell commented 2 years ago

Is this happening in the browser only? I believe the WHATWG URL parser in nodejs supports it.

A small subclass of the URL class to work around this issue probably wouldn't be difficult to contribute as a PR. The idea being to stub in the https: protocol but then stub it out again in toString if it was stubbed in.

boutell commented 2 years ago

(As a PR on this module that is. Modifying URL upstream is unrealistic of course.)

gmarinov commented 2 years ago

@boutell , I guess what I'm saying is that in the absence of explicit allow-list of domains, it shouldn't even try to police URLs.

$ node
Welcome to Node.js v16.14.0.
Type ".help" for more information.

> new URL('//google.com')
Uncaught TypeError [ERR_INVALID_URL]: Invalid URL
    at __node_internal_captureLargerStackTrace (node:internal/errors:464:5)
    at new NodeError (node:internal/errors:371:5)
    at onParseError (node:internal/url:552:9)
    at new URL (node:internal/url:628:5) {
  input: '//google.com',
  code: 'ERR_INVALID_URL'
}

> new URL('ftp://google.com');
URL {
  href: 'ftp://google.com/',
  origin: 'ftp://google.com',
  protocol: 'ftp:',
  username: '',
  password: '',
  host: 'google.com',
  hostname: 'google.com',
  port: '',
  pathname: '/',
  search: '',
  searchParams: URLSearchParams {},
  hash: ''
}
> 
boutell commented 2 years ago

I don't think it would be a good idea to accept invalid URLs, but I agree that protocol relative URLs should not be considered invalid, at least by default, at least not yet. This is why I'm suggesting using a subclass wrapper for URL that accepts that particular case without reimplementing URL.

boutell commented 2 years ago

(The lack of support for protocol relative URLs in Node 16 does increase the priority on our end but it's not something we can immediately schedule here as we don't have a client requirement for it right now.)

On Tue, Feb 22, 2022 at 9:17 AM Georgi Marinov @.***> wrote:

@boutell https://github.com/boutell , I guess what I'm saying is that in the absence of explicit allow-list of domains, it shouldn't even try to police URLs.

$ node Welcome to Node.js v16.14.0. Type ".help" for more information.

new URL('//google.com') Uncaught TypeError [ERR_INVALID_URL]: Invalid URL at __node_internal_captureLargerStackTrace (node:internal/errors:464:5) at new NodeError (node:internal/errors:371:5) at onParseError (node:internal/url:552:9) at new URL (node:internal/url:628:5) { input: '//google.com', code: 'ERR_INVALID_URL' }

new URL('ftp://google.com'); URL { href: 'ftp://google.com/', origin: 'ftp://google.com', protocol: 'ftp:', username: '', password: '', host: 'google.com', hostname: 'google.com', port: '', pathname: '/', search: '', searchParams: URLSearchParams {}, hash: '' }

— Reply to this email directly, view it on GitHub https://github.com/apostrophecms/sanitize-html/issues/531#issuecomment-1047842026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAH27P7462HKMFC27FIE4TU4OLH5ANCNFSM5M57AQYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

--

THOMAS BOUTELL | CHIEF TECHNOLOGY OFFICER APOSTROPHECMS | apostrophecms.com | he/him/his

gmarinov commented 2 years ago

it's just that till a few updates ago, sanitize-html didn't have this issue, right?

and in the browser, where the sanitized html ends up, protocol-relative URLs are not invalid.

boutell commented 2 years ago

src has always been passed through naughtyHref no matter what the tag is. I think what we're seeing is that Node 16 now has the same strict policy on protocol relative URLs that is enforced by Safari. A more tolerant subclass of Url would resolve it.