OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
843 stars 213 forks source link

Use HTML5 comment syntax instead of HTML4 #231

Open alelievre44 opened 3 years ago

alelievre44 commented 3 years ago

Hello

I would like to submit you a problem in function HtmlInputSplitter.parseToken() .

When you sanitize HTML text like : "<!-- COMMENT -- ME -> I'M ALONE --> MY CODE"

the parser detect end comment just before "I'M" rather than before "MY CODE".

Problem is due to this code in HtmlInputSplitter.parseToken():

    case COMMENT_DASH_DASH:     
        if ('>' == ch) {
         state = HtmlInputSplitter.State.DONE;
         type = HtmlTokenType.COMMENT;
         } else if ('-' == ch) {
             state = HtmlInputSplitter.State.COMMENT_DASH_DASH;
         } else {
             state = HtmlInputSplitter.State.COMMENT_DASH;
         }
         break;

the last state need to be HtmlInputSplitter.State.COMMENT otherwise end comment is detected after "->" otherwise "-->

    case COMMENT_DASH_DASH:
    ....
        } else if ('-' == ch) {
             state = HtmlInputSplitter.State.COMMENT_DASH_DASH;
        } else {
             state = HtmlInputSplitter.State.COMMENT
        }
        break;

Could you fix this issue, please ?

mikesamuel commented 3 years ago

I think it was written that way for compatibility with HTML4 quirks. (Emphasis added)

White space is not permitted between the markup declaration open delimiter("<!") and the comment open delimiter ("--"), but is permitted between the comment close delimiter ("--") and the markup declaration close delimiter (">"). A common error is to include a string of hyphens ("---") within a comment. Authors should avoid putting two or more adjacent hyphens inside comments.

IIRC, some HTML4 parsers treated any > after a non-starting -- sequence as closing.

That said, we can fix the lexer to conform to HTML5's comment grammar as long as the HTML serializer does not emit comments that have -- in odd places.


The current syntax is governed by html.spec.whatwg.org §13.1.6.

Comments must have the following format:

  • The string "<!--".
  • Optionally, text, with the additional restriction that the text must not start with the string ">", nor start with the string "->", nor contain the strings "<!--", "-->", or "--!>", nor end with the string "<!-".
  • The string "-->".

Additionally §13.2.8 notes XML compatibility restrictions:

If the XML API restricts comments from having two consecutive U+002D HYPHEN-MINUS characters (--), the tool may insert a single U+0020 SPACE character between any such offending characters.

which IIUC, alludes to the (Char - '-') in XML § 2.5

 Comment   ::=   '<!--' ((Char - '-') | ('-' (Char - '-')))* '-->'
alelievre44 commented 3 years ago

Thanks for the explanation. In my use case, I receive parts of HTML code and I want to sanitize this parts before displaying them on a page. So, how could I disallow the use of "--" in comments, to warn my client when I receive this ambiguous code ? I don't find any methods before parseToken() to detect this problem.

alelievre44 commented 3 years ago

Thanks for the explanation. In my use case, I receive parts of HTML to analyse, so I'm not the owner. How can I detect an error in parser when we receive "--" in a comment, to disallow this usage ? I didn't find anything

alelievre44 commented 3 years ago

I don't know why the issue was closed, so i reopened it!!!

mikesamuel commented 2 years ago

I don't know that there's a good way to do this. The early stages of processing throw away any comment tokens and pre-processors get raw text so can't distinguish between --> MY CODE from --&gt; MY CODE.

mikesamuel commented 2 years ago

You could use the HtmlLexer to create an event stream and look at it for ambiguity so you can give feedback in addition to running the string through the normal sanitizer pipeline.

jmanico commented 2 years ago

I think wiping out all comments is the right way to sanitize for a security library.