I would like to validate

Pamplemousse commented 2 years ago

Hopefully this will prompt a discussion, and I'm interested why people want validators. Please let me know your thoughts on use cases and how that relates to the definition of valid.

@mikesamuel in https://github.com/OWASP/java-html-sanitizer/blob/main/docs/html-validation.md#background

Intro

First of all, thanks for writing "Why sanitize when you can validate?" (which I believe should be titled "Why validate when you can sanitize?"). This document conveys its point greatly, has good examples, and interesting details. I stumbled upon it while searching for means to validate HTML input; I read it carefully, and it seems to not quite address a use case I have, so I am "prompting a discussion"! :slightly_smiling_face:

Use case

Consider a big product that takes an input, stores it somewhere, and has several means of "rendering it": in HTML on the web, PDF or CSV rendering, via an API, etc.

One approach could be to store the input as is, and encode it when used in output. Encoding will be different depending on the context; For HTML, we could use the sanitizer of this repo.

Sadly, we do not always control all the means of generating output. For example, it could be a third-party application pulling this data from our API, or an internal application written by a team with no security training, or legacy code, etc.

For defense in depth, we want to validate the data before storing it. Sanitizing for HTML would corrupt the data for every other use (Jack O'Neill would be Jack O'Neill everywhere).

Last words

To be fair, I admit I truncated the quote ^

Herein I address why I think the latter is a bad idea for HTML specifically.

But:

In complex systems, HTML is rarely the only way to output data;
Not storing junk or potential exploits is helpful for the health of the system as a whole.

mikesamuel commented 2 years ago

Sanitizing for HTML would corrupt the data for every other use (Jack O'Neill would be Jack O'Neill everywhere).

If Jack O'Neill is a string of HTML, then how would rendering it as Jack O'Neill corruption? The two are semantically the same as HTML.

If you want to identify safe&valid strings in some language other than HTML, then I don't see why an HTML validator would accurately determine that a string is safe&valid in that other language.

For defense in depth, we want to validate the data before storing it.

I disagree as explained in Validity is unstable in the face of emerging threats.

Also, there is no such thing as a valid or safe string; only one that is valid or safe to use in some context. f you have store data that is valid&safe in languages X & Y, but someone retrieves that data and uses it as a string in language Z, then they need to validate it as a string in language Z.

Pamplemousse commented 2 years ago

If Jack O'Neill is a string of HTML

It's not necessarily. From the application point of view, this is the name of the person, and the application doesn't know exactly where it will be rendered.

If you want to identify safe&valid strings in some language other than HTML, then I don't see why an HTML validator would accurately determine that a string is safe&valid in that other language.

This is not what I want, sorry if that was uncler in the original comment: I am thinking of using an HTML validator for data entries that are supposed to not contain HTML.

Take the name of a person example again: Generally, names should not contain HTML, so it makes sense to verify that they "do not contain HTML". As "Why sanitize when you can validate? > Defining Valid" describes, it's hard to define "not containing HTML" without involving browser interpretation. But the goal is not to get a silver bullet, but drastically reduce the attack surface, i.e. the set of payloads that can be injected through the name. The "silver bullet" would be encoding, if you control the output of the data, which is not always the case (legacy consumer code, third-party integration, sloppy mistakes, etc.).

I disagree as explained in Validity is unstable in the face of emerging threats.

It's kinda missing the point though: defense-in-depth is not necessarily "silver bullets at every levels". In the presence of an injection vulnerability, constraining user inputs at least raise the complexity (thus the cost) of developing an exploit, if not prevent some by "luck".

Also, there is no such thing as a valid or safe string; only one that is valid or safe to use in some context

Correct.

In practice, when talking about a web application, the most likely context is HTML rendering. Validating strings that aren't supposed to contain HTML is an approach to reduce the risk of injection in case encoding isn't properly done or sloppily forgotten.

mikesamuel commented 2 years ago

If you want to define a language negatively, plain text that does not contain HTML tags, comments, or directives, then do that.

There are parser tools like ANTLR that will help check that a string is in a language. So you can reject plain text that has meta-characters like & and < followed by certain characters.

I suspect that that language is going to get progressively more complicated over time. When someone runs a client-side auto-link detector that linkifies www.example.com and @username which does something surprising, are you going to revise the language to plain text that does not contain ... and which does not contain substrings that trigger corner cases in your auto-linker.

I suspect that it is also going to run into problems with quotation marks. Embeddable in HTML where a text node is expected, does not imply embeddable in an attribute like <img alt="..."> or <table summary='...'>.

My skepticism that this is doable in practice comes from seeing people try these and fail. That doesn't mean that there aren't many wildly commercially successful products that claim to do this. I just don't buy that the effort put into deploying them wouldn't be better spent on using a contextually auto-escaping template language and integrating the right sanitizers at the right places.

OWASP / java-html-sanitizer