OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
843 stars 213 forks source link

HTML Comment Sanitization Issue #235

Open log2akshat opened 2 years ago

log2akshat commented 2 years ago

Hi,

We are using this library in Zimbra for sanitization of the e-mail body and during sanitization of the customer-generated HTML, we came across the following situation when we have a text in the HTML comment then during sanitization it is not able to parse properly.

<style><ZZZ!--
/* Font Definitions */
@font-face
=09{font-family:Wingdings;
...
....
.....
=09{margin-bottom:0in;}
--></style>

After removing the text ZZZ from the comment the content of the e-mail body is displayed properly. As it is treating the comment as a nested tag inside the style tag and searching for the closing of the nested tag </ and when it finds </style> it considers it as the closing of the nested tag <ZZZ> and leaving the other tags unbalanced.

I have also gone through the code of the HtmlLexer but not able to figure it out since the parsing is happening based on one character at a time so regex pattern matching will not be possible to handle this issue.

It will be great if someone can guide me on how to handle this situation or it can be considered as an enhancement or bugfix.

mikesamuel commented 2 years ago

Unfortunately, putting ZZZ in there makes that not an HTML comment.

The lexer should recognize that <style> content is raw text so you should be able to use a pre-processor to modify the text based on the not-quite-HTML dialect that you're dealing with.

https://github.com/OWASP/java-html-sanitizer#preprocessors