OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
849 stars 214 forks source link

Fixes in HTML Lexer to support HTML empty comment statements #327

Open jfbyers opened 7 months ago

jfbyers commented 7 months ago

Summary and intro

The HTML Lexer fails to detect empty HTML comment declarations leading into the next piece of HTML object in the input to not be detected as such.

Basically, for an input such as <!><img src=1 onError=alert(1)>, the lexer considers the whole blob as an HTML comment instead of an empty comment declaration (<!>) and then an img tag (which is what browsers would do).

This bug in the lexer, provides wrong information to the HtmlChangeListener

HTML RFC context

In the HTML RFC, comments are defined as:

To include comments in an HTML document, use a comment declaration. A comment declaration consists of `<!' followed by zero or more comments followed by `>'. Each comment starts with `--' and includes all text up to and including the next occurrence of `--'.

This means that <!> is a valid comment declaration with zero comments inside. Which means the following HTML code would trigger the alert(1) if rendered in a browser:

<html><body>
<!><script>alert(1)</script>
</body></html>

In addition, the pattern <!--> is also transformed by browsers to <!-- --> (not sure why yet, but tested in all major browsers). This means the following HTML code would also trigger the alert(1) if rendered in a browser:

<html><body>
<!--><script>alert(1)</script>
</body></html>

The bug

The Lexer does not consider neither <!> nor <!--> as valid comment declaration statements, considering the last character of both statements (>) still as part of the comment. This means, when the sanitizer reads the following input <!><img src=1 onError=alert(1)>, the lexer will interpret the whole input as an HTML comment. However, the expected behavior would be to detect <!> as an HTML comment declaration, and <img src=1 onError=alert(1)> as an img HTML tag.

This is easy to see with the HtmlChangeListener class. For example, the following code provides the following output:

package org.owasp.html;

import java.util.ArrayList;
import java.util.List;

class BugTest
{
    public static void main(String[] args)
    {
        String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!><img src=1>qwe4", "qwe5<!--><img src=1>qwe6"};
        List<String> results = new ArrayList<String>();
        for (String s : test)
        {
            MyListener htmlChangeListener = new MyListener();
            PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
            String safeString = sanitizePolicy.sanitize(s, htmlChangeListener, results);
            System.out.println( "safeString :"  + safeString + " "+htmlChangeListener.getOutOfPolicyObjects());
        }

    }

}

... 

package org.owasp.html;

import javax.annotation.Nonnull;
import javax.annotation.Nullable;
import java.util.LinkedList;
import java.util.List;

public class MyListener implements HtmlChangeListener<List<String>>
{
    private final List<String> outOfPolicyObjects = new LinkedList<String>();

    @Override
    public void discardedTag(@Nullable List<String> context, @Nonnull String elementName)
    {
        outOfPolicyObjects.add(elementName);
    }

    @Override
    public void discardedAttributes(@Nullable List<String> context, @Nonnull String tagName, @Nonnull String... attributeNames)
    {
        outOfPolicyObjects.add(String.format("%s (%s)", tagName, String.join(",", attributeNames)));
    }

    public List<String> getOutOfPolicyObjects()
    {
        return outOfPolicyObjects;
    }
}

The output:

safeString :qwe1qwe2 [img]
safeString :qwe3qwe4 []
safeString :qwe5 []

However, the expected correct output should be:

safeString :qwe1qwe2 [img]
safeString :qwe3qwe4 [img]
safeString :qwe5qwe6 [img]

The fix

We added a new state called COMMENT_DASH_AFTER_BANG into the HtmlInputSplitter inside HtmlLexer to handle dash character after the bang one (!-) Also we created special condition checks in the BANG and BANG_DASH states inside the lexer's state machine to handle <!> and <!--> comments.

Authors

Bug discovery: Carlos Villa (@carlosvillasanchez) Code fix: Eduardo Aguado (@jfbyers)

subbudvk commented 7 months ago

@jfbyers Can you pl provide a test case, where with a minimal policy, post sanitization, the vector is not sanitized?

jfbyers commented 7 months ago

Hi @subbudvk , with the current implementation the following strings are not properly sanitized:

class BugTest
{
    public static void main(String[] args)
    {
        String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!>qwe4", "qwe5<!-->qwe6"};
        for (String s : test)
        {
            PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
            String safeString = sanitizePolicy.sanitize(s, null, null);
            System.out.println( "safeString :"  + safeString);
        }

    }
}

This outputs:

safeString :qwe1qwe2
safeString :qwe3
safeString :qwe5

And it should be:

safeString :qwe1qwe2
safeString :qwe3qwe4
safeString :qwe5qwe6

As I mentioned in my first message (apologies if I did not explain myself clearly) this is even more misleading if you use a listener to add tags/attributes in

discardedTag() 

or

discardedAttribute()

from the string being sanitized e.g string qwe3<!><img src=1 onerror=alert(1)> would not add <img src=1 onerror=alert(1)> as a discarded tag so any library consumer might take incorrect decisions about the contents of the string.

jfbyers commented 6 months ago

Thanks @melloware , can you approve / run the workflows or you want me to add more unit tests?

melloware commented 6 months ago

No you are good. I don't have commit privs so I can't run your workflow only @mikesamuel has permissions to this repo.

mikesamuel commented 6 months ago

Which HTML RFC are you quoting?

https://html.spec.whatwg.org/multipage/syntax.html#comments seems to disagree. https://html.spec.whatwg.org/multipage/parsing.html#parse-errors does class <! under incorrectly opened comment, but why would the incorrectly opened comment end at > though instead of -->?

And it should be:

safeString :qwe1qwe2 safeString :qwe3qwe4 safeString :qwe5qwe6

Why?

subbudvk commented 6 months ago

@mikesamuel : Can you kindly release a new version at least with whatever changes that are already merged to master?

melloware commented 6 months ago

+1 to @subbudvk just get any new version out there that gets rid of Guava.

jfbyers commented 6 months ago

Added support for 2 comment parser errors https://html.spec.whatwg.org/multipage/parsing.html#parse-errors :

carlosvillasanchez commented 6 months ago

Which HTML RFC are you quoting? ....

Hey @mikesamuel

We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.

To include comments in an HTML document, use a comment declaration. A comment declaration consists of <! followed by zero or more comments followed by >. Each comment starts with -- and includes all text up to and including the next occurrence of --. In a comment declaration, white space is allowed after each comment, but not before the first comment. The entire comment declaration is ignored

Based on this, <!> it is indeed a valid self closed comment declaration. Without any comment.

It is true, that based on this, <!--> it is not a valid self closed comment declaration. However, we have seen in the most used browsers (Chrome, Firefox, Edge, Safari), the behavior is to modify this blob into <!----> which is indeed a valid self closed comment section, with one empty comment.

I can build some PoC for both scenarios detailed above, let me know if that is needed or if this is enough information.

Thanks!!

mikesamuel commented 6 months ago

+1 to @subbudvk just get any new version out there that gets rid of Guava.

Done

mikesamuel commented 6 months ago

We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.

I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995. The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.

subbudvk commented 6 months ago

+1 to @subbudvk just get any new version out there that gets rid of Guava.

Done

Thanks @mikesamuel ! Appreciate your contribution to the open source community.

carlosvillasanchez commented 5 months ago

I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995. The WhatWG steering group is representatives from the main browser vendors, and html.whatwg.spec.org is the best approximation of what HTML is.

About this, please forgive me in advance... I am not super used to referencing in RFCs and/or browsers standards.

That being said, in https://html.spec.whatwg.org/multipage/syntax.html#comments they state a comment must start with <!-- and finish with -->, which as we are seeing it is not true. Browsers clearly identify a pattern like <!> as a comment section.

Not sure if this means this HTML spec is wrong, or it is just not the one followed by browsers.

Also the pattern <!--> that browsers turn into <!-- --> its not specified in the spec (I guess this one rightfully so).

Again, let me know if it is useful to provide a live PoC for this.