OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
850 stars 214 forks source link

Wrong sanitised output for link #254

Closed alex-alvarezg closed 2 years ago

alex-alvarezg commented 2 years ago

The following input

/test/?param1=valueOne&param2=valueTwo will be sanitized to:

/test/?param1=valueOne¶m2=valueTwo but should be sanitized to

/test/?param1=valueOne&param2=valueTwo

The following code is used:

    private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder()
            .toFactory()
            .and(Sanitizers.LINKS);

URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo")
jmanico commented 2 years ago

That's just text, not a link. I am not set up right now and am on the run, but can you try:

|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("<a href=\"/test/?param1=valueOne&param2=valueTwo\">click me"); or similar? |

On 3/22/22 10:39 AM, alex-alvarezg wrote:

The following input

|/test/?param1=valueOne&param2=valueTwo | will be sanitized to:

|/test/?param1=valueOne¶m2=valueTwo | but should be sanitized to

|/test/?param1=valueOne&param2=valueTwo |

The following code is used:

|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo") |

— Reply to this email directly, view it on GitHub https://github.com/OWASP/java-html-sanitizer/issues/254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBYCIRNMDB6SI44VBY7GLVBIAWFANCNFSM5RLRZS5A. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jim Manico Manicode Security https://www.manicode.com

alex-alvarezg commented 2 years ago

Thanks Jim, will give it a try, it used to work with previous versions.

alex-alvarezg commented 2 years ago

Still fails with output:

<a href="/test/?param1&#61;valueOne¶m2&#61;valueTwo" rel="nofollow">click me</a>

it does work correctly if something else is used, instead of param

So we found out that if any part of the string matches a character entity reference (as in this table: https://dev.w3.org/html5/html-author/charref ), it will be converted to the entity

for instance:

For the entity &para; = ¶

The input will be converted as follows:

&param2 = ¶m2

mikesamuel commented 2 years ago

Ok, so the problem is that &para without a semicolon is decoded to the paragraph symbol. I think the operable code here is

https://github.com/OWASP/java-html-sanitizer/blob/33d319f876abbb35cb95eddf9705c46bd96822bd/src/main/java/org/owasp/html/HtmlEntities.java#L1724-L1727

That's derived from

https://github.com/OWASP/java-html-sanitizer/blob/33d319f876abbb35cb95eddf9705c46bd96822bd/src/main/java/org/owasp/html/HtmlEntities.java#L46-L47

I think we're actually handling this according to spec since https://html.spec.whatwg.org/entities.json still has a line

"&para": { "codepoints": [182], "characters": "\u00B6" },

Do browsers do some less thorough entity matching for URL attribute values? I don't remember that changing but I haven't been following html.spec as closely as I could.

kusako commented 2 years ago

Hi, I'm far from an expert on this, but browsers seem to have different behaviour for attribute values. Maybe because of https://html.spec.whatwg.org/#named-character-reference-state .

In practice something like <a href="/foo?param1=1&param2=2">foo</a> is probably rather common, so personally I think sanitization shouldn't break this. Also it used to work in 20191001.1, but seems to stop working with 20200713.1 and later.

mikesamuel commented 2 years ago

Thanks for the pointer, @kusako. Another thing to check is whether the name match is greedy; whether <a href="?x&para="> should decode to include ¶ but <a href="?x&param="> should not.

<doctype html>
<meta charset="utf-8"/>
<style>a { display: block }</style>

<a href="?x&para="></a>
<a href="?x&param="></a>
<a href="?x&para"></a>
<a title="?x&para=" href=.></a>
<a title="?x&param=" href=.></a>
<a title="?x&para" href=.></a>

<script>
(() => {
    const links = document.querySelectorAll('a')
    links.forEach((link) => {
        let { href, title } = link;
        if (title) {
            link.textContent = `title is ${title}`;
        } else {
            link.textContent = `href is ${href}`;
        }
    });
})();
</script>

produces on Chrome, Firefox, Safari:

href is file:///tmp/foo.html?x&para=
href is file:///tmp/foo.html?x&param=
href is file:///tmp/foo.html?x%C2%B6
title is ?x&para=
title is ?x&param=
title is ?x¶

So it seems like the rule is, if there's no semicolon at the end of the entity name, and the next character is not valid in a character reference, and the next character is not =, then decode.

I can probably write some JS to test that further, but changing the decode loop to something like that should address the problem.

mikesamuel commented 2 years ago

It looks like we need to grab = and ASCII alphanumerics but only when decoding an attribute, not when decoding text node content.

Continues character reference name in CDATA

Continues character reference name in title attribute

  1. U+30 - U+39: [0-9]
  2. U+3d: [=]
  3. U+41 - U+5a: [A-Z]
  4. U+61 - U+7a: [a-z]

The above was derived by looking at each basic plane code-point:

<h1>Continues character reference name in CDATA</h1>

<ol id="continuers-cdata"></ol>

<h1>Continues character reference name in title attribute</h1>
<ol id="continuers-attr"></ol>
<script>
(() => {
    let continuers = {
        cdata: document.getElementById("continuers-cdata"),
        attr: document.getElementById("continuers-attr"),
    };
    let starts = {
        cdata: null,
        attr: null,
    }
    let div = document.createElement("div");
    let limit = 0x10000;
    for (let i = 0; i < limit; ++i) {
        div.innerHTML = `<span title="&para${String.fromCharCode(i)}">&para${String.fromCharCode(i)}</span>`;
        let span = div.querySelector("span");
        let { textContent, title } = span;
        step(i, !textContent.startsWith("\u00b6"), 'cdata');
        step(i, !title.startsWith("\u00b6"), 'attr');
    }
    for (let key in continuers) { step(limit, false, key); }

    function step(i, present, key) {
        let start = starts[key];
        if (start !== null) {
            if (!present) {
                let end = i - 1;
                let continuersList = continuers[key];
                starts[key] = null;
                let li = document.createElement("li");
                let range = (start === end)
                    ? `U+${start.toString(16)}`
                    : `U+${start.toString(16)} - U+${end.toString(16)}`;
                let chars = (start === end)
                    ? String.fromCharCode(start)
                    : `${String.fromCharCode(start)}-${String.fromCharCode(end)}`;
                li.appendChild(document.createTextNode(`${range}: [${chars}]`));
                continuersList.appendChild(li);
            }
        } else if (present) {
            starts[key] = i;
        }
    }
})();
</script>