`baseline` should be `default` and `default` should be `baseline`

Most other sanitizers have a default list that is according to some standard (an opinionated default, a default that cares about XSS and nothing else, ...). Most other sanitizers also allow configurations in which developers may re-allow dangerous and XSS-y markup.

We aim for the Sanitizer API to not allow configurations that might reintroduce XSS. For that reason we've had a couple of "list categories" namely. We have one of each category for attributes and elements. The categories are as follows:

baseline: The baseline lists for elements/attributes contain every known element that can not introduce XSS. Even if the element is able to lead to other attacks (form hijacking, DOM Clobbering etc.). This is our safety net. If you want to allow an element through the configuration, it needs to be in the baseline.
default: These lists are an opinionated subset of the baseline, they existed with the intent of preventing some further attacks. We never fully specified which use cases we want to include or exclude in our default :(
known: This is a last line of defense for configurations with allowUnknownElements: true. The intent is to allow configurations that include a foobar element which is obviously non-existant but also harmless. We check every unknown element to be actually unknown, as allowUnknownElements: true should not be used to re-introduce XSSy elements like script.

Lately, we've been discussing what the default is actually defending against and we could not really come up with a hard line.

My suggestion is to therefore remove the distinction between baseline and default. We should, however introduce threat-model specific configurations that are less "opinionated" and orient along hard facts. E.g., a configuration that disallows form hijacking & DOM Clobbering on top of the baseline.

In a background conversation, @otherdaniel was in favor of this but wanted to think about it some more before moving on. We're curious to hear other opinions, but I hope not to dwell on this for very long.

When I created the original sanitizer in Mozilla, it was for the purpose of Thunderbird sanitizing HTML emails. Thunderbird still has that feature and uses it in more and more places, to counter security bugs.

The goals are:

Remove any possibility at all to ever have a critical security hole. This specifically also included security holes not already known. The sanitizer was presuming that known holes would be fixed in the engine, but that other holes appear regularly (monthly), and the goal was stop those future holes as well. For that goal, any complex browser feature that is not absolutely necessary in email would be disabled.
- Any and all forms of JS or other scripts had to be blocked, and any vector how scripts might be sneaking in. That was one of the primary goals of the sanitizer. This was a double-protection in addition to disabling JS in that iframe.
- It also means blocking video/audio decoders, because they are optimized for performance and not security and the security team was playing "whack-a-mole" with critical security bugs in media decoders.
- The only concession was to allow (inline) images, even despite the possibility of buffer overflow bugs and similar bugs in the image decoders. But images are so important, and bugs in image decodes were relatively rare, that this was an accepted risk.
- Indeed, this policy has served very well over the years and has stopped many many browser security holes from working when the sanitizer was active.
Disable remote fetches, for privacy reasons
- Rationale: Spammers, advertizers, and nosy email senders can use "phone home" pings to track whether the email reached the inbox, and whether, when, where (IP address), and how often it was read. This is a serious privacy violation.
- All remote content needs to be removed: (remote) images, external CSS, fonts, and any other kind of remote content. Everything that might possibly trigger a HTTP call needs to be removed.
Remove obnoxious formatting by the sender.
- Some senders want to style their outgoing email and use large fonts, or small fonts, or blue-on-grey colors, or other formatting, simply because they find it pretty. But the reader might have impair eye sight and cannot read small fonts or low contrast, or might simply want to read his email in a consistent form for efficiency reasons.
- Therefore, there was an option to remove all style/presentational formatting.
- Allow semantic structure like headings, bulleted lists, emphasis (strong/em/bold/italics) etc, these should remain, and rendered using the browser defaults.

Part 3 is surely highly opinionated and should be optional. Parts 1 and 2 are essential to the purpose of the sanitizer in Thunderbird.

I would think that most use cases for a sanitizer would be:

webmail rendering emails
rendering user-generated content, e.g. discussion forums, chat, ratings, and similar cases where untrusted content comes in, but should include some limited formatting like bold and bulleted lists.

I would think that almost all of these use cases want very strong assurances for parts 1 and 2, and some might also want part 3.

I therefore think that the default config for the sanitizer should satisfy the requirements in Part 1 and 2 above, and that Part 3 should be optional and very easy to add.

This basic config should then be adaptable fine-grained on a per tag/attribute or similar level to allow or deny specific aspects, as discussed in #181 . E.g. if a use case needs to allow MP4 videos in user comments, it would be able to specifically allow that, but still otherwise use the default config.

Mostly agree. The idea behind default-vs-baseline was that default could be a bit more opinionated. We don't seem to have a good definition of which opinions those should be, though.

The main issue I have with making baseline the default and thus removing the distinction entirely, is an ambition to make Sanitizer API easy to use. The feedback I've been getting comes primarily from professional security folks, who have a good understanding of whichever threat model they're working under. For these people a clearly defined, XSS-only baseline that they can build on is exactly what they want. They have no interest in "opinionated". But Regular Joe that makes a webpage probably doesn't even have a threat model, but certainly deserves a secure website as much as anyone. This clientele would likely benefit from a more "complete" default configuration. I'm unsure how to resolve these constraints.

nitpick: known isn't a defense against anything. That's just a mechanism to un-ambiguously specify allowUnknownMarkup, rather than leaving it implementation defined or defined with some weasel-wording that is difficult to parse, as in previous drafts.

Looking at the API through a security engineer lens, I don't see a good way to offer a default setting that would be different than baseline.

The difficulty with providing an implicit opinionated setting that would be a good enough fit for all is that it's likely a quickly moving target, given how HTML evolves over time. The ways of executing JS from markup don't change as often as a way of styling content, fetching resources, or even parsing inline file formats (e.g. <img src="data:image/newthing">) - and in my opinion API stability in terms of generating the same output from the same input is also a property that developers expect, and may even rely on (e.g. through golden files in tests). If the default sanitizer behavior would often change, the API gets more difficult to test, polyfill, there's more opportunity of breaking the web apps in subtle ways.

That said, I think there is a valid use case for configuring the sanitizer to do something else than preventing XSS, in a convenient way. Application developer should not need to know all 30+ knobs that can disable remote fetches - but I think it's better if this is an explicit decision left to the developer, and not the default, because there security is always in a context of a threat model - so the developers should specify what flavor of risks they want to mitigate.

Even for the cases listed here in the thread, I already see that the Thunderbird sanitizer has significant differences to sanitizer that Gmail uses. Webmail is very likely a complex type of application for which the sanitizer rules follow from the product features (e.g. Gmail rewrites image links to transcode the images, hide the viewer IP addresses etc. Both external images and forms in email are a feature many UAs support, sometimes it's guarded by user preferences. Some UAs only support inline styles, some rewrite the CSS to make sure that it can't affect other parts of the UI. Some like static images, but not animations. And so on...). I simply don't think a Web API can ever propose a Webmail-ready sanitizer setting.

It's probably much easier to offer a setting that removes forms, anything that would cause a fetch, or only simple markup. To me it feels like the actual use case and popular settings are best to be fleshed out in user code and only then moved to the platform, but if we converge on popular settings that we are confident would be useful in existing applications, that's great too - but it feels like a v2 thing?

quickly moving target, given how HTML evolves over time

... and developers cannot be expected to constantly keep up with these developments and adapt their code. That's why it's important that these preset profiles are constantly adapted as new features are added to the browser or new security or privacy risks are being discovered. That's the job that the sanitizer API has to do for the developer. This should be the default. Default is secure, in even in the future, in face of newly discovered risks.

The developer just says "I don't trust this HTML. So, sanitizer, please remove anything potentially dangerous from it. I don't care whether it still looks the same as the original. I strongly prefer security over faithfulness of original." So, yes, the output of the sanitizer will and has to change over time, as new risks are discovered.

If a dev wants a very specific set of filtering, and does not want it to change over time, then he should opt for a very low bar (much less secure) as baseline and then use the specific allow/deny feature discussed in #181 to select a specific set of features to filter. That would then stay the same over time. It would expose risks over time, and the dev would be responsible to keep up. But given that very few devs can do that, this should not be the default.

In other words: By default, provide a secure sanitization, which can also evolve over time to stay secure as new risks appear. Optionally, for those expert devs who want fine control and are accepting the security risk, they can customize it as they want.

sanitizer rules follow from the product features

That's what the customization feature should allow. It should start with a good secure sanitizer setting which most closely matches the needs, then add overrides to allow or deny specific features.

I don't think we can define what a 'secure sanitizer' is, without enumerating the specific risks we mean to protect against. There is a wide understanding that JS execution from markup should be stopped, but there is no similar consensus about other content, for example:

inline subresources
frames
subresource fetches
styles
ids and class attributes (they can be used for CSS hijacking)
links to cross-origin resources, links to http://, links to custom protocols
target attribute on links (tabnabbing!)
custom elements

to just name the few. You argue to default to a strict sanitization for 'anything potentially dangerous', but this needs agreeing on what dangerous means exactly. Case in point: every HTML sanitizer I've seen so far has different defaults even for the above settings. If the whole web community for years so far didn't come up with a convention of what a 'secure HTML subset' is, why do you think that the solution devised by a few on this thread would be the best one?

"I don't trust this HTML. So, sanitizer, please remove anything potentially dangerous from it.

A sanitizer that just returns .textContent from the parsed root node meets that definition, but it's probably not what is needed.

I don't care whether it still looks the same as the original. I strongly prefer security over faithfulness of original."

In my experience working with the app developers, the reverse is true. The business priority is that the outputs look like the input, we only need to make sure that the malicious input is filtered out. Users are really not happy when the copy-paste doesn't work reliably across web applications with WYSIWYG editors, for example.

That's a great discussion so far. Thanks to all involved. I think @koto's latest comments summarizes my main gripe here. We all know there is a shared understanding of the threat model "no XSS". Everything else seems to be quite app-specific with a lot of "grey-area".

I'll stick with my recommendation to remove the distinction between baseline and default and focus on XSS first and foremost.

At a later stage (and not here!), we may add static properties to the constructor like Sanitizer.CONFIG_DISALLOW_XYZ.

Any objections?

I've discussed this with a few people on our end, so we can make some progress:

I think we're re-discussing the (security) requirements here. If we focus the Sanitizer strictly on "XSS first and foremost", and arguably on being a building block (vs an end-user focused, more opinionated API), then I think this proposal makes sense. I'm happy to go along with it, provided we're consistent about it and adopt this as the official threat model & requirement for all of the Sanitizer API.

If we narrowly focus on XSS (plus additional, user-configurable options), then indeed the distinction between baseline and default is moot and drops out. But also, the presently spec-ed Sanitizer configuration has a number of additional options, none of which have a narrow XSS aspect to it: allowComments, allowUnknownMarkup, allowCustomElements. All of these presently default to false, based on the same "opinionated" logic that got us default-vs-baseline. I'm assuming these should be default-true, then.

Similarly, allowShadowRoots is being discussed elsewhere. The same logic would apply there, I guess, and it'd be default allowed?

I'll also note that this seems to be in direct contradiction to #185, which aims to make the underlying security model opt-out-able.

So, as said: I'm happy to agree on strictly "no XSS" as the Sanitizer threat model. But it'd have consequences beyond just dropping the baseline-vs-default distinction.

My two cents as a web developer - I'm using the Sanitizer API currently to ensure that HTML strings returned from the server contain exactly the HTML elements I expect and no more. Because I know the full list of attributes and elements that I should ever see, I can use a whitelist instead of a blacklist - meaning I don't really run into any of the problems described above.

It's hardly perfect - somebody determined to screw with my markup still can, it's just that all they can do is add extra rows to a table and such, rather than load scripts and images.

While I hardly expect it to be built into the API itself, I imagine that a "copy paste friendly" whitelist for e.g. expected markup from markdown would cover a lot of use cases.

We just had a synchronous chat with Sanitizer contributors and decided that baseline is default and default is baseline.

If people want and need more, stricter protections, there will be custom Sanitizer allow lists on top of this default.

Specifically, we're hoping to look into having some provided by the API as mentioned in #82, but this will have to wait until the foundation is solid enough to build upon. Hopefully, soon!

Next step: Rewrite the spec such that baseline === default.

We just had a synchronous chat with Sanitizer contributors and decided that baseline is default and default is baseline.

What I actually said at the meeting is that we should agree on whether we stick to a strict threat model -- "XSS first and foremost" -- or not; and that the baseline decision falls out of that; and that I'm quite fine with a strict "XSS first and foremost", which would indeed mean the baseline/default distinction drops out.

I'm happy to follow group consensus on whether we sharpen the thread model to "XSS first and foremost" or not, but doing so would seem to require some additional cleanup. I'll open a separate issue to make sure this gets answered.

WICG / sanitizer-api

`baseline` should be `default` and `default` should be `baseline` #183