HTML Comment Sanitization Issue

Hi,

We are using this library in Zimbra for sanitization of the e-mail body and during sanitization of the customer-generated HTML, we came across the following situation when we have a text in the HTML comment then during sanitization it is not able to parse properly.

<style><ZZZ!--
/* Font Definitions */
@font-face
=09{font-family:Wingdings;
...
....
.....
=09{margin-bottom:0in;}
--></style>

After removing the text ZZZ from the comment the content of the e-mail body is displayed properly. As it is treating the comment as a nested tag inside the style tag and searching for the closing of the nested tag </ and when it finds </style> it considers it as the closing of the nested tag <ZZZ> and leaving the other tags unbalanced.

I have also gone through the code of the HtmlLexer but not able to figure it out since the parsing is happening based on one character at a time so regex pattern matching will not be possible to handle this issue.

It will be great if someone can guide me on how to handle this situation or it can be considered as an enhancement or bugfix.

OWASP / java-html-sanitizer

HTML Comment Sanitization Issue #235