Wrong sanitised output for link

alex-alvarezg commented 2 years ago

The following input

/test/?param1=valueOne&param2=valueTwo will be sanitized to:

/test/?param1=valueOne¶m2=valueTwo but should be sanitized to

/test/?param1=valueOne&param2=valueTwo

The following code is used:

    private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder()
            .toFactory()
            .and(Sanitizers.LINKS);

URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo")

jmanico commented 2 years ago

That's just text, not a link. I am not set up right now and am on the run, but can you try:

|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("<a href=\"/test/?param1=valueOne&param2=valueTwo\">click me"); or similar? |

On 3/22/22 10:39 AM, alex-alvarezg wrote:

The following input

|/test/?param1=valueOne&param2=valueTwo | will be sanitized to:

|/test/?param1=valueOne¶m2=valueTwo | but should be sanitized to

|/test/?param1=valueOne&param2=valueTwo |

The following code is used:

|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("/test/?param1=valueOne&param2=valueTwo") |

— Reply to this email directly, view it on GitHub https://github.com/OWASP/java-html-sanitizer/issues/254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBYCIRNMDB6SI44VBY7GLVBIAWFANCNFSM5RLRZS5A. You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jim Manico Manicode Security https://www.manicode.com

alex-alvarezg commented 2 years ago

Thanks Jim, will give it a try, it used to work with previous versions.

alex-alvarezg commented 2 years ago

Still fails with output:

<a href="/test/?param1=valueOne¶m2=valueTwo" rel="nofollow">click me</a>

it does work correctly if something else is used, instead of param

So we found out that if any part of the string matches a character entity reference (as in this table: https://dev.w3.org/html5/html-author/charref ), it will be converted to the entity

for instance:

For the entity ¶ = ¶

The input will be converted as follows:

&param2 = ¶m2

mikesamuel commented 2 years ago

Ok, so the problem is that &para without a semicolon is decoded to the paragraph symbol. I think the operable code here is

https://github.com/OWASP/java-html-sanitizer/blob/33d319f876abbb35cb95eddf9705c46bd96822bd/src/main/java/org/owasp/html/HtmlEntities.java#L1724-L1727

That's derived from

https://github.com/OWASP/java-html-sanitizer/blob/33d319f876abbb35cb95eddf9705c46bd96822bd/src/main/java/org/owasp/html/HtmlEntities.java#L46-L47

I think we're actually handling this according to spec since https://html.spec.whatwg.org/entities.json still has a line

"&para": { "codepoints": [182], "characters": "\u00B6" },

Do browsers do some less thorough entity matching for URL attribute values? I don't remember that changing but I haven't been following html.spec as closely as I could.

kusako commented 2 years ago

Hi, I'm far from an expert on this, but browsers seem to have different behaviour for attribute values. Maybe because of https://html.spec.whatwg.org/#named-character-reference-state .

In practice something like <a href="/foo?param1=1&param2=2">foo</a> is probably rather common, so personally I think sanitization shouldn't break this. Also it used to work in 20191001.1, but seems to stop working with 20200713.1 and later.

mikesamuel commented 2 years ago

Thanks for the pointer, @kusako. Another thing to check is whether the name match is greedy; whether <a href="?x&para="> should decode to include ¶ but <a href="?x&param="> should not.

<doctype html>
<meta charset="utf-8"/>
<style>a { display: block }</style>

<a href="?x&para="></a>
<a href="?x&param="></a>
<a href="?x&para"></a>
<a title="?x&para=" href=.></a>
<a title="?x&param=" href=.></a>
<a title="?x&para" href=.></a>

<script>
(() => {
    const links = document.querySelectorAll('a')
    links.forEach((link) => {
        let { href, title } = link;
        if (title) {
            link.textContent = `title is ${title}`;
        } else {
            link.textContent = `href is ${href}`;
        }
    });
})();
</script>

produces on Chrome, Firefox, Safari:

href is file:///tmp/foo.html?x&para=
href is file:///tmp/foo.html?x&param=
href is file:///tmp/foo.html?x%C2%B6
title is ?x&para=
title is ?x&param=
title is ?x¶

So it seems like the rule is, if there's no semicolon at the end of the entity name, and the next character is not valid in a character reference, and the next character is not =, then decode.

I can probably write some JS to test that further, but changing the decode loop to something like that should address the problem.

mikesamuel commented 2 years ago

It looks like we need to grab = and ASCII alphanumerics but only when decoding an attribute, not when decoding text node content.

Continues character reference name in CDATA

Continues character reference name in title attribute

U+30 - U+39: [0-9]
U+3d: [=]
U+41 - U+5a: [A-Z]
U+61 - U+7a: [a-z]

The above was derived by looking at each basic plane code-point:

<h1>Continues character reference name in CDATA</h1>

<ol id="continuers-cdata"></ol>

<h1>Continues character reference name in title attribute</h1>
<ol id="continuers-attr"></ol>
<script>
(() => {
    let continuers = {
        cdata: document.getElementById("continuers-cdata"),
        attr: document.getElementById("continuers-attr"),
    };
    let starts = {
        cdata: null,
        attr: null,
    }
    let div = document.createElement("div");
    let limit = 0x10000;
    for (let i = 0; i < limit; ++i) {
        div.innerHTML = `<span title="&para${String.fromCharCode(i)}">&para${String.fromCharCode(i)}</span>`;
        let span = div.querySelector("span");
        let { textContent, title } = span;
        step(i, !textContent.startsWith("\u00b6"), 'cdata');
        step(i, !title.startsWith("\u00b6"), 'attr');
    }
    for (let key in continuers) { step(limit, false, key); }

    function step(i, present, key) {
        let start = starts[key];
        if (start !== null) {
            if (!present) {
                let end = i - 1;
                let continuersList = continuers[key];
                starts[key] = null;
                let li = document.createElement("li");
                let range = (start === end)
                    ? `U+${start.toString(16)}`
                    : `U+${start.toString(16)} - U+${end.toString(16)}`;
                let chars = (start === end)
                    ? String.fromCharCode(start)
                    : `${String.fromCharCode(start)}-${String.fromCharCode(end)}`;
                li.appendChild(document.createTextNode(`${range}: [${chars}]`));
                continuersList.appendChild(li);
            }
        } else if (present) {
            starts[key] = i;
        }
    }
})();
</script>

OWASP / java-html-sanitizer

Wrong sanitised output for link #254

Continues character reference name in CDATA

Continues character reference name in title attribute