Open jfbyers opened 7 months ago
@jfbyers Can you pl provide a test case, where with a minimal policy, post sanitization, the vector is not sanitized?
Hi @subbudvk , with the current implementation the following strings are not properly sanitized:
class BugTest
{
public static void main(String[] args)
{
String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!>qwe4", "qwe5<!-->qwe6"};
for (String s : test)
{
PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
String safeString = sanitizePolicy.sanitize(s, null, null);
System.out.println( "safeString :" + safeString);
}
}
}
This outputs:
safeString :qwe1qwe2
safeString :qwe3
safeString :qwe5
And it should be:
safeString :qwe1qwe2
safeString :qwe3qwe4
safeString :qwe5qwe6
As I mentioned in my first message (apologies if I did not explain myself clearly) this is even more misleading if you use a listener to add tags/attributes in
discardedTag()
or
discardedAttribute()
from the string being sanitized e.g string qwe3<!><img src=1 onerror=alert(1)>
would not add <img src=1 onerror=alert(1)>
as a discarded tag so any library consumer might take incorrect decisions about the contents of the string.
Thanks @melloware , can you approve / run the workflows or you want me to add more unit tests?
No you are good. I don't have commit privs so I can't run your workflow only @mikesamuel has permissions to this repo.
Which HTML RFC are you quoting?
https://html.spec.whatwg.org/multipage/syntax.html#comments seems to disagree. https://html.spec.whatwg.org/multipage/parsing.html#parse-errors does class <!
under incorrectly opened comment, but why would the incorrectly opened comment end at >
though instead of -->
?
And it should be:
safeString :qwe1qwe2 safeString :qwe3qwe4 safeString :qwe5qwe6
Why?
@mikesamuel : Can you kindly release a new version at least with whatever changes that are already merged to master?
+1 to @subbudvk just get any new version out there that gets rid of Guava.
Added support for 2 comment parser errors https://html.spec.whatwg.org/multipage/parsing.html#parse-errors :
Which HTML RFC are you quoting? ....
Hey @mikesamuel
We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.
To include comments in an HTML document, use a comment declaration. A comment declaration consists of
<!
followed by zero or more comments followed by>
. Each comment starts with--
and includes all text up to and including the next occurrence of--
. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored
Based on this, <!>
it is indeed a valid self closed comment declaration. Without any comment.
It is true, that based on this, <!-->
it is not a valid self closed comment declaration. However, we have seen in the most used browsers (Chrome, Firefox, Edge, Safari), the behavior is to modify this blob into <!---->
which is indeed a valid self closed comment section, with one empty comment.
I can build some PoC for both scenarios detailed above, let me know if that is needed or if this is enough information.
Thanks!!
+1 to @subbudvk just get any new version out there that gets rid of Guava.
Done
We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.
I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995. The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.
+1 to @subbudvk just get any new version out there that gets rid of Guava.
Done
Thanks @mikesamuel ! Appreciate your contribution to the open source community.
I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995. The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.
About this, please forgive me in advance... I am not super used to referencing in RFCs and/or browsers standards.
That being said, in https://html.spec.whatwg.org/multipage/syntax.html#comments they state a comment must start with <!--
and finish with -->
, which as we are seeing it is not true. Browsers clearly identify a pattern like <!>
as a comment section.
Not sure if this means this HTML spec is wrong, or it is just not the one followed by browsers.
Also the pattern <!-->
that browsers turn into <!-- -->
its not specified in the spec (I guess this one rightfully so).
Again, let me know if it is useful to provide a live PoC for this.
Summary and intro
The HTML Lexer fails to detect empty HTML comment declarations leading into the next piece of HTML object in the input to not be detected as such.
Basically, for an input such as
<!><img src=1 onError=alert(1)>
, the lexer considers the whole blob as an HTML comment instead of an empty comment declaration (<!>
) and then animg
tag (which is what browsers would do).This bug in the lexer, provides wrong information to the
HtmlChangeListener
HTML RFC context
In the HTML RFC, comments are defined as:
This means that
<!>
is a valid comment declaration with zero comments inside. Which means the following HTML code would trigger thealert(1)
if rendered in a browser:In addition, the pattern
<!-->
is also transformed by browsers to<!-- -->
(not sure why yet, but tested in all major browsers). This means the following HTML code would also trigger thealert(1)
if rendered in a browser:The bug
The Lexer does not consider neither
<!>
nor<!-->
as valid comment declaration statements, considering the last character of both statements (>
) still as part of the comment. This means, when the sanitizer reads the following input<!><img src=1 onError=alert(1)>
, the lexer will interpret the whole input as an HTML comment. However, the expected behavior would be to detect<!>
as an HTML comment declaration, and<img src=1 onError=alert(1)>
as animg
HTML tag.This is easy to see with the
HtmlChangeListener
class. For example, the following code provides the following output:The output:
However, the expected correct output should be:
The fix
We added a new state called
COMMENT_DASH_AFTER_BANG
into theHtmlInputSplitter
insideHtmlLexer
to handle dash character after the bang one (!-
) Also we created special condition checks in theBANG
andBANG_DASH
states inside the lexer's state machine to handle<!>
and<!-->
comments.Authors
Bug discovery: Carlos Villa (@carlosvillasanchez) Code fix: Eduardo Aguado (@jfbyers)