WICG / sanitizer-api

https://wicg.github.io/sanitizer-api/
Other
226 stars 31 forks source link

Proposed API: MathML, SVG support, plus localname case-handling. #103

Open otherdaniel opened 3 years ago

otherdaniel commented 3 years ago

Seperate issue, because this tries to pull together several existing issues in one.

The current problems are:

WDYT?


I find the case-handling based on the namespace to be slightly surprising, but this matches what HTML parser / DOM does. If we embrace that, I think we the rest of the API is fairly straight-forward.

securityMB commented 3 years ago
  • In all places where we accept element names -- allow lists or block lists, or elements in attribute allow/block lists -- we will accept pseudo-namespaced element names "html:name", "svg:name", "math:name", or "*:name". The latter can also be written as "name". That is, any namespace is the default.

After thinking about it for a little while, I see some problems with using : as a namespace separator:

  1. Even though nobody probably does that, : can be used in a tag name. For instance <abc:div> will create an element called abc:div in the DOM tree.
  2. The fact that a single string can contain both namespace and tag name makes it a little bit less convenient to type, for instance in TypeScript. If we used another syntax, for instance: ['html', 'name'] or {tagName:'name', namespace:'html'} (but this one is probably too verbose?), then IDEs could give suggestions for developers really easy.

Furthermore, I am not sure that any namespace should be the default. The last few bypasses of DOMPurify stemmed from the fact that some elements were created in unexpected namespace (like form in MathML namespace).

otherdaniel commented 3 years ago

Thanks, this is good feedback!

Andrew-Cottrell commented 3 years ago

I think "html:" as the default would be least surprising for a majority of developers; and also the safer option. It may not be necessary to have "html:" as an explicit option; the Sanitizer API could only have no prefix, "svg:", "math:", and "*:". I'm not sure "*:" is absolutely necessary either and might eventually become a lint target if overused (like specifying "*" for targetOrigin in the postMessage API).

It might be useful to be able to block "svg:*" and "math:*".

securityMB commented 3 years ago
  • Colons in names: I don't think we should re-invent (or re-implement) "full" namespaces. I think we should pick a handful of prefixes with special meaning. Then, if all 3 namespaces browsers actually process have a prefix, one can specify even weird names with a colon in them. Doing so is awkward, but as that is probably a rather exceptional use case that'd be okay.

Agreed on "full" namespaces. My argument was just about the separator that we use. I saw some tests in web platform that used space as a separator (for instance: svg animate means animate in SVG namespace) which seems fine, because you cannot have a space in tag name.

  • TypeScript / IDEs: I'm not sure I get this. Is the issue that by fusing the namespace into the string makes it opaque to type systems and IDEs? What's a case where that'd make a difference? (Off-hand, it seems that a structure with a namespace designator and a seperate local name would be pretty awkward to use, for little benefit in the majority case.)

Yeah, I'm not that entirely sure on importance of this one. My argument was that if you make a typo in the namespace, for instance: "svf:animate", then you'll see the mistake only on runtime. If you split the namespace and tag name (for instance: ["svf", "animate"]), then IDE can spot the mistake right away because the namespace part can be typed to 'svg' | 'html' | 'math'.

'any' as default: This is an excellent point. For a block-list my proposal makes sense, but for an allow-list... not so much. I wonder if HTML could be the default instead - given that that's probably the 99% use case - or if this means there is no meaningful default.

My proposal is to:

otherdaniel commented 2 years ago

Looks like we've let this linger for a while... let's have a new go at it.

After reading the feedback here and there, and discussing this with @mozfreddyb, I'd like to - for now - prefer simplicity over expressiveness, so that can start with a simple proposal and extend it as concrete use cases emerge.

The cornerstones are:

================================

This would be the proposal:

Andrew-Cottrell commented 2 years ago
  • No namespace designator defaults to "html" for elements, and to none for attributes.

    • Alternative: Drop the "html" prefix, or the default, so that there's a unique way to designate HTML elements.
  • These rules form a 1:1 relationship between config strings and namespace/elements, except for HTML element, which have a 2:1 relationship. The getConfiguration getters normalize HTML strings to their non-prefixed form.

    • Alternative: I'm confident we should have a prescribed normalization, but I'm unsure which way it should go. I'm fine with either prefixed/non-prefixed.

To me, it seems simpler to drop the "html" prefix, which would ensure a 1:1 relationship in all cases and resolve getConfiguration normalization. But I think the strongest reason to drop the "html" prefix is that would reflect how people currently read & write HTML. If the "html" prefix is retained but not required, I expect most authors would not use it.

It would be good to see specified use cases or other reasons for retaining the "html" prefix, which may suggest answers to the indicated alternatives quoted above.

Andrew-Cottrell commented 2 years ago
  • Element/attribute names with a namespace separator but no valid namespace designator in front of it are an error and are dropped.
    • Alternative: We could also throw an exception.

Given the indeterminate lifetime of HTML a new namespace designator may eventually be needed, so it may be more backwards compatible (e.g. newer code in an older browser) to drop rather than throw. I expect static analysis tools or runtime validation libraries will be developed if invalid configuration becomes a problem such that people want to verify. However, there might be security reasons to throw rather than drop.

mozfreddyb commented 2 years ago

That's great, thanks for writing that up. I mostly agree, except this one thing:

* For convenience, the config gets an allowXXX boolean setting for each namespace, to allow users to turn of e.g. all of SVG without having to re-write their config entirely.

  * Alternative: We might not do this at all, since the naming rules make the config easily filter-able.
  * Alternative: Instead of several boolean-valued config items we could also have one config item which takes a string set and re-use the namespace designators.
  * I'm not sure what the default should be, but I lean towards allow only HTML elements + non-namepsaces attributes by default.

I don't like the idea of adding lots of boolean settings. Instead, I suggest we provide static constants on the sanitizer that allow building and combining lists. We can bikeshed on the names, I don't have strong feelings, but something along the lines of Sanitizer.ALLOWED_HTML_ELEMENTS, Sanitizer.ALLOWED_SVG_ELEMENTS would help implementing your use cases in an (imho) clearer way.

otherdaniel commented 2 years ago

To me, it seems simpler to drop the "html" prefix, which would ensure a 1:1 relationship in all cases and resolve getConfiguration normalization. But I think the strongest reason to drop the "html" prefix is that would reflect how people currently read & write HTML. If the "html" prefix is retained but not required, I expect most authors would not use it.

It would be good to see specified use cases or other reasons for retaining the "html" prefix, which may suggest answers to the indicated alternatives quoted above.

My intuition was that if everyhing else has a prefix then so should HTML, but also that requiring an HTML prefix is an awful lot of extra typing. Admittedly, that's a rather weak argument, and dropping the html prefix would certainly make things simpler.

Given the indeterminate lifetime of HTML a new namespace designator may eventually be needed, so it may be more backwards compatible (e.g. newer code in an older browser) to drop rather than throw. I expect static analysis tools or runtime validation libraries will be developed if invalid configuration becomes a problem such that people want to verify. However, there might be security reasons to throw rather than drop.

This is very true. We should drop (rather than throw). Especially since .getConfiguration() allows developers to check what the browser actually made out of their config.

I don't like the idea of adding lots of boolean settings. Instead, I suggest we provide static constants on the sanitizer that allow building and combining lists. We can bikeshed on the names, I don't have strong feelings, but something along the lines of Sanitizer.ALLOWED_HTML_ELEMENTS, Sanitizer.ALLOWED_SVG_ELEMENTS would help implementing your use cases in an (imho) clearer way.

Yes, that'd also work. I wonder how many 'real' use cases there are for this. I'm guessing HTML-only and everything are common, which can be easily covered by presets.

otherdaniel commented 2 years ago

I'm currently looking at how to spec this. I initially thought I'd go with whitespace as separator, as discussed above, but https://html.spec.whatwg.org/multipage/syntax.html#attributes-2 already specifies a character-by-character representation of all namespaced attributes allowed in HTML, in the table at the end of the subsection. And that uses a colon. I suspect that re-inventing our own representation is not a good idea then.

In either case, I'll update the issue when I have a review-quality PR ready.

otherdaniel commented 2 years ago

Hello again, I've now uploaded PR #137, which drafts supports for SVG and MathML.

I'm not super happy with the result, so it'd be fantastic if people could offer some opinions on whether this is going in the right directions, and how to improve it.

In particular:

ju1ius commented 2 years ago

Hi @otherdaniel,

I've just read the spec draft and first let me thank you for your work because I think this is something much needed for the platform !

Now WRT namespaces, there is an existing syntax that could be reused here: CSS type selectors.

The advantages of using this syntax would be:

The following example would block bar elements in the urn:foo namespace, baz elements in any namespace, qux elements without a namespace and gizmo elements in the default namespace, in this case that of the document element:

const sanitizer = new Sanitizer({
  defaultNamespace: document.documentElement.namespaceURI,
  namespaces: {
    foo: 'urn:foo',
  },
  blockElements: ['foo|bar', '*|baz', '|qux', 'gizmo'],
})

If the syntax were to be defined in terms of CSS type selectors, this would allow to gradually introduce support for other CSS selectors into the API, so that things like this would become possible:

const sanitizer = new Sanitizer({
  dropElements: [
    'a[target="_blank"]',
    'iframe:not([src^="https://www.youtube.com"])',
  ],
})

What do you think ?

Andrew-Cottrell commented 2 years ago

I think using a subset of the CSS selector syntax is a really interesting idea. My main concern with supporting a large subset would be the serializing rules. But many people are using CSS selectors with DOM APIs and things generally seem to work well (although greater use of [CSS.escape](https://www.w3.org/TR/cssom-1/#the-css.escape()-method) would probably help). In any case, I would be happy using | as the pseudo-namespace delimiter.

ju1ius commented 2 years ago

@Andrew-Cottrell I don't get why CSS serialization rules would be an issue here. Could you elaborate on this?

Andrew-Cottrell commented 2 years ago

I don't get why CSS serialization rules would be an issue here. Could you elaborate on this?


const badSanitizer = new Sanitizer({
dropElements: [
'div.123abc' // incorrectly serialized, but an easy mistake to make
]
});

const goodSanitizer = new Sanitizer({ dropElements: [ 'div.\31 23abc' // correctly serialized, could also use: 'div.' + CSS.escape('123abc') ] });



This probably isn't a serious problem in practice, but I'm slightly concerned with the ease of this mistake in a security context.
ju1ius commented 2 years ago

Ah I see, indeed the spec would need to define what to do in case of an invalid selector. Whether to throw a SyntaxError DOM exception, silently ignore it or whatever would make the most sense in a security-sensitive context.

Currently, the spec just says to removes element names that were normalized to null from the allow lists, so the same should probably be done in case of an invalid selector...

mozfreddyb commented 1 year ago

triage: Let's keep this one open to ensure we have alignment on a v1 list of allowed elements (html, svg, mathml)

mozfreddyb commented 5 months ago

Should be moot with #208 landed.

annevk commented 5 months ago

Do we have an SVG and MathML safelist? Is that tracked anywhere else? That's the main thing I can still see missing.

bkardell commented 4 months ago

Do we have an SVG and MathML safelist? Is that tracked anywhere else? That's the main thing I can still see missing.

It seems that MathML isn't even mentioned in the spec currently? SVG is here, but that mention only points to the SVG namespace (which is right below the MathML namespace in that doc :)). Is there a reason it wasn't included?