Open simon-greatrix opened 3 years ago
Thanks for the detailed report.
What about replacing any banned code-point with the replacement character (U+FFFD)?
It looks like, from that link, that the set of banned codepoints is [U+0..U+8, U+B, U+C, U+E, U+F, U+7F..U+9F, U+FDD0..U+FDEF, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, U+10FFFF] but in numeric character references you also need to ban U+D. Does that seem right?
UTF-16 surrogate pairs where one or both of the surrogates are specified via reference seem like non-minimal sequence problem. Should those be banned? Replaced with one replacement char or two?
If so, we'd need to do some work during lexing.
A quick test suggests that we don't have to worry about surrogate pairs where one is specified via reference.
I ran this:
System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("\ud83e\udd21"));
System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("�\udd21"));
System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("\ud83e�"));
System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("��"));
System.out.println(new HtmlPolicyBuilder().toFactory().sanitize("🤡"));
and go the result "🤡" every time (That's the clown face emoji, by the way).
Having had a few hours to think about, I think removing the bad characters in the same way we do for isolated surrogates and U+FFFE and U+FFFF is the right thing to do.
The "bad characters" with the rationale would be:
U+0000 to U+0008 : C0 Control characters, banned by XML. U+000B to U+000C : C0 Control characters, banned by XML. U+000E to U+001F : Again, C0 control characters, banned by XML. U+007F to U+007F : The "Delete" character. Banned by HTML as numeric escape. Potential for misuse to hide information. U+0080 to U+0084 : C1 control character discouraged by XML. Banned as HTML numeric escape. U+0085 to U+0085 : The EBCDIC "next line" character. Banned as HTML numeric escape. U+0086 to U+009F : Remaining C1 control characters. Discouraged by XML, banned as HTML numeric escape. U+D800 to U+DFFF : Surrogates. Bad only if not part of a surrogate pair. U+FDD0 to U+FDEF : Non-characters. U+FFFE to U+FFFF : Non-characters U+_p_FFFE to U+_p_FFFF : (Where p is a supplemental plane from 0x01 to 0x10) Non characters
Additionally:
U+0085 : Whilst not discouraged by XML, there is no consistent definition of how the character should be handled. Some assume it is badly encoded Windows-1252 and show an ellipsis. Some recognise it is a new line, some just as white space. As there is no clear use for U+0085, I believe any usage of it can be viewed with suspicion, and therefore it should be elided.
U+000D : HTML bans this as a numeric escape, but if used as a raw CR it becomes an attack vector in systems where it is treated as a Carriage Return and not a New Line. The carriage return moves the output point back to the start of the line and allows overwriting of data. Whilst not a problem in HTML, this is a problem as a log-file injection.
I believe that it would be best to normalize newlines as described here: https://infra.spec.whatwg.org/#normalize-newlines
This removes all occurrences of U+000D without changing the meaning of the HTML.
So my proposal is to normalize newlines and then remove anything on the list of "bad characters" above.
How does that sound as a solution?
Ok. I guess as long as the way it's done does not affect the HTML renderer then I expect that it won't cause much fallout.
Then we can focus testing on token merging conflicts as where Ø
in the below is replaced with a banned char:
<Øscript>
<sØcript ="">
<oØbject>
<a title=Øonclick=alert(1)>
<a href="javascript&Ø#58;alert(1)">a</a>
Newline normalizing sounds good. I think a few tests for CRLF vs CR should do fine.
It might be worth updating the fuzzer test so that the fragments it splices around include a banned character and a U+D. That's a pretty terrible fuzzer, but it's saved me from releasing bugs into prod before.
Thanks for taking this on.
I have created a pull request for my proposed fix. I have scattered comments on the proposed code changes to explain my thinking.
Thanks. I see #225
The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:
However, non-characters from the supplemental planes are encoded numerically.
The code:
Produces: ""
I see two possible simple possible solutions, but I am loathe to recommend either one:
First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.
Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.
More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets.
I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.