OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
854 stars 214 forks source link

Preserve the casing of tags and attributes names. #182

Closed wookie41 closed 4 years ago

wookie41 commented 5 years ago

It would be nice if the the policy builder had an option that disabled the "sanitization" of tags and attributes names. Currently all names are converted to lowercase which is ok when you're using it for HTML only, but if there is an SVG image nested inside the HTML it breaks. For example, when viewBox attribute on is converted to viewbox and the image is not displayed correctly.

zeeneir commented 4 years ago

Please add an option to preserve casing of tags and attributes. I've the same issue with sanitizing svg. This is one of the best tools out there and this particular shortcoming is a show stopper for my project. Any possible workarounds is also appreciated until this is fixed. Thanks!

wookie41 commented 4 years ago

@zeeneir a while back I did something like this and actually forgot to open a pull request for it. I'll review it on monday and open a PR.

https://github.com/wookie41/java-html-sanitizer/commit/affcccd93cec5a3c91856316179658471a2006f0

zeeneir commented 4 years ago

@wookie41 excellent and thanks :ok_hand:

jmanico commented 4 years ago

Can you explain how case sensitivity is a problem? and render the same as a simple example. Where is this a problem for you, can you give us a test case?

Regards.

Jim Manico @Manicode

On Mar 27, 2020, at 10:50 AM, zeeneir notifications@github.com wrote:

 Please add an option to preserve casing of tags and attributes. I've the same issue with sanitizing svg. This is one of the best tools out there and this particular shortcoming is a show stopper for my project. Any possible workarounds is also appreciated until this is fixed. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

wookie41 commented 4 years ago

@jmanico SVG renderers actually care about the casing (at least the one in Firefox) and the image simply doesn't show, when viewBox is sanitized to viewbox. I know that this isn't an svg sanitizer, but it does the job I need it to do after creating an svg policy.

jmanico commented 4 years ago

That’s a pretty solid test case of you ask me.

-- Jim Manico @Manicode

On Mar 27, 2020, at 11:03 AM, Łukasz Bogaczyński notifications@github.com wrote:

 @jmanico SVG renderers actually care about the casing (at least the one in Firefox) and the image simply doesn't show, when viewBox is sanitized to viewbox. I know that this isn't an svg sanitizer, but it does the job I need it to do, after creating an svg policy.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

zeeneir commented 4 years ago

Same comment as @wookie41, Agree that this is HTML sanitizer, but I think this small change would allow proper sanitizing of SVG documents. Some renderers don't care about case, but others do. Mozilla states:

SVG elements and attributes should all be entered in the case shown here since XML is case-sensitive (unlike HTML). Here: https://developer.mozilla.org/en-US/docs/Web/SVG/Tutorial/Introduction

zeeneir commented 4 years ago

any updates for this issue?

mikesamuel commented 4 years ago

What's the list of names that need to be preserved?

HTML uppercase qualified name provides a notion that we could use for a canonical name that prefers lower-case ASCII.

If an element/attribute name is not defined in the HTML namespace and it is defined in the corresponding SVG or MathML namespace, then we use the preferred casing from that namespace, otherwise we convert to lower-case in a locale-insensitive manner.

I don't think the xlink namespace matters.

Would that be sufficient?

zeeneir commented 4 years ago

If an element/attribute name is not defined in the HTML namespace and it is defined in the corresponding SVG or MathML namespace, then we use the preferred casing from that namespace, otherwise we convert to lower-case in a locale-insensitive manner.

Ideal would be a setting on the policy builder that indicates whether case should be preserved for tags/attributes, however, this proposal is good as well. If you go this route, you don't need us to provide a list, do you?

mikesamuel commented 4 years ago

@zeeneir If the bug results from a mismatch in how converted names are interpreted by HTML parsers then I don't see how more policy settings would help.

zeeneir commented 4 years ago

@mikesamuel, I see. Would you proposal cover overlapping tags/attributes? One example I can think of: HTML has