Forbidden numeric character references appear in sanitized HTML

simon-greatrix commented 3 years ago

The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

However, non-characters from the supplemental planes are encoded numerically.

The code:

    StringBuilder builder = new StringBuilder();
    Encoding.encodeRcdataOnto(Character.toString(0x5fffe), builder);
    System.out.println(builder.toString());

Produces: "񟿾"

I see two possible simple possible solutions, but I am loathe to recommend either one:

First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.

Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.

More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets.

I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.

mikesamuel commented 3 years ago

Thanks for the detailed report.

What about replacing any banned code-point with the replacement character (U+FFFD)?

It looks like, from that link, that the set of banned codepoints is [U+0..U+8, U+B, U+C, U+E, U+F, U+7F..U+9F, U+FDD0..U+FDEF, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF] but in numeric character references you also need to ban U+D. Does that seem right?

UTF-16 surrogate pairs where one or both of the surrogates are specified via reference seem like non-minimal sequence problem. Should those be banned? Replaced with one replacement char or two?

If so, we'd need to do some work during lexing.

simon-greatrix commented 3 years ago

A quick test suggests that we don't have to worry about surrogate pairs where one is specified via reference.

I ran this:

    System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("\ud83e\udd21"));
    System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("&#xd83e;\udd21"));
    System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("\ud83e&#xdd21;"));
    System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("&#xd83e;&#xdd21;"));
    System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("&#x1f921;"));

and go the result "🤡" every time (That's the clown face emoji, by the way).

simon-greatrix commented 3 years ago

Having had a few hours to think about, I think removing the bad characters in the same way we do for isolated surrogates and U+FFFE and U+FFFF is the right thing to do.

The "bad characters" with the rationale would be:

U+0000 to U+0008 : C0 Control characters, banned by XML. U+000B to U+000C : C0 Control characters, banned by XML. U+000E to U+001F : Again, C0 control characters, banned by XML. U+007F to U+007F : The "Delete" character. Banned by HTML as numeric escape. Potential for misuse to hide information. U+0080 to U+0084 : C1 control character discouraged by XML. Banned as HTML numeric escape. U+0085 to U+0085 : The EBCDIC "next line" character. Banned as HTML numeric escape. U+0086 to U+009F : Remaining C1 control characters. Discouraged by XML, banned as HTML numeric escape. U+D800 to U+DFFF : Surrogates. Bad only if not part of a surrogate pair. U+FDD0 to U+FDEF : Non-characters. U+FFFE to U+FFFF : Non-characters U+_p_FFFE to U+_p_FFFF : (Where p is a supplemental plane from 0x01 to 0x10) Non characters

Additionally:

U+0085 : Whilst not discouraged by XML, there is no consistent definition of how the character should be handled. Some assume it is badly encoded Windows-1252 and show an ellipsis. Some recognise it is a new line, some just as white space. As there is no clear use for U+0085, I believe any usage of it can be viewed with suspicion, and therefore it should be elided.

U+000D : HTML bans this as a numeric escape, but if used as a raw CR it becomes an attack vector in systems where it is treated as a Carriage Return and not a New Line. The carriage return moves the output point back to the start of the line and allows overwriting of data. Whilst not a problem in HTML, this is a problem as a log-file injection.

I believe that it would be best to normalize newlines as described here: https://infra.spec.whatwg.org/#normalize-newlines

This removes all occurrences of U+000D without changing the meaning of the HTML.

So my proposal is to normalize newlines and then remove anything on the list of "bad characters" above.

How does that sound as a solution?

mikesamuel commented 3 years ago

Ok. I guess as long as the way it's done does not affect the HTML renderer then I expect that it won't cause much fallout.

Then we can focus testing on token merging conflicts as where Ø in the below is replaced with a banned char:

<Øscript>

<sØcript ="">

<oØbject>

<a title=Øonclick=alert(1)>

<a href="javascript&Ø#58;alert(1)">a</a>

Newline normalizing sounds good. I think a few tests for CRLF vs CR should do fine.

It might be worth updating the fuzzer test so that the fragments it splices around include a banned character and a U+D. That's a pretty terrible fuzzer, but it's saved me from releasing bugs into prod before.

Thanks for taking this on.

simon-greatrix commented 3 years ago

I have created a pull request for my proposed fix. I have scattered comments on the proposed code changes to explain my thinking.

mikesamuel commented 3 years ago

Thanks. I see #225

OWASP / java-html-sanitizer

Forbidden numeric character references appear in sanitized HTML #223