Closed alex-alvarezg closed 2 years ago
That's just text, not a link. I am not set up right now and am on the run, but can you try:
|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("<a href=\"/test/?param1=valueOne¶m2=valueTwo\">click me"); or similar? |
On 3/22/22 10:39 AM, alex-alvarezg wrote:
The following input
|/test/?param1=valueOne¶m2=valueTwo | will be sanitized to:
|/test/?param1=valueOne¶m2=valueTwo | but should be sanitized to
|/test/?param1=valueOne¶m2=valueTwo |
The following code is used:
|private final PolicyFactory URL_POLICY = new HtmlPolicyBuilder() .toFactory() .and(Sanitizers.LINKS); URL_POLICY.sanitize("/test/?param1=valueOne¶m2=valueTwo") |
— Reply to this email directly, view it on GitHub https://github.com/OWASP/java-html-sanitizer/issues/254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBYCIRNMDB6SI44VBY7GLVBIAWFANCNFSM5RLRZS5A. You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Jim Manico Manicode Security https://www.manicode.com
Thanks Jim, will give it a try, it used to work with previous versions.
Still fails with output:
<a href="/test/?param1=valueOne¶m2=valueTwo" rel="nofollow">click me</a>
it does work correctly if something else is used, instead of param
So we found out that if any part of the string matches a character entity reference (as in this table: https://dev.w3.org/html5/html-author/charref ), it will be converted to the entity
for instance:
For the entity
¶ = ¶
The input will be converted as follows:
¶m2 = ¶m2
Ok, so the problem is that ¶
without a semicolon is decoded to the paragraph symbol.
I think the operable code here is
That's derived from
I think we're actually handling this according to spec since https://html.spec.whatwg.org/entities.json still has a line
"¶": { "codepoints": [182], "characters": "\u00B6" },
Do browsers do some less thorough entity matching for URL attribute values? I don't remember that changing but I haven't been following html.spec as closely as I could.
Hi, I'm far from an expert on this, but browsers seem to have different behaviour for attribute values. Maybe because of https://html.spec.whatwg.org/#named-character-reference-state .
In practice something like <a href="/foo?param1=1¶m2=2">foo</a>
is probably rather common, so personally I think sanitization shouldn't break this.
Also it used to work in 20191001.1, but seems to stop working with 20200713.1 and later.
Thanks for the pointer, @kusako.
Another thing to check is whether the name match is greedy; whether <a href="?x¶=">
should decode to include ¶ but <a href="?x¶m=">
should not.
<doctype html>
<meta charset="utf-8"/>
<style>a { display: block }</style>
<a href="?x¶="></a>
<a href="?x¶m="></a>
<a href="?x¶"></a>
<a title="?x¶=" href=.></a>
<a title="?x¶m=" href=.></a>
<a title="?x¶" href=.></a>
<script>
(() => {
const links = document.querySelectorAll('a')
links.forEach((link) => {
let { href, title } = link;
if (title) {
link.textContent = `title is ${title}`;
} else {
link.textContent = `href is ${href}`;
}
});
})();
</script>
produces on Chrome, Firefox, Safari:
href is file:///tmp/foo.html?x¶=
href is file:///tmp/foo.html?x¶m=
href is file:///tmp/foo.html?x%C2%B6
title is ?x¶=
title is ?x¶m=
title is ?x¶
So it seems like the rule is, if there's no semicolon at the end of the entity name, and the next character is not valid in a character reference, and the next character is not =
, then decode.
I can probably write some JS to test that further, but changing the decode loop to something like that should address the problem.
It looks like we need to grab =
and ASCII alphanumerics but only when decoding an attribute, not when decoding text node content.
The above was derived by looking at each basic plane code-point:
<h1>Continues character reference name in CDATA</h1>
<ol id="continuers-cdata"></ol>
<h1>Continues character reference name in title attribute</h1>
<ol id="continuers-attr"></ol>
<script>
(() => {
let continuers = {
cdata: document.getElementById("continuers-cdata"),
attr: document.getElementById("continuers-attr"),
};
let starts = {
cdata: null,
attr: null,
}
let div = document.createElement("div");
let limit = 0x10000;
for (let i = 0; i < limit; ++i) {
div.innerHTML = `<span title="¶${String.fromCharCode(i)}">¶${String.fromCharCode(i)}</span>`;
let span = div.querySelector("span");
let { textContent, title } = span;
step(i, !textContent.startsWith("\u00b6"), 'cdata');
step(i, !title.startsWith("\u00b6"), 'attr');
}
for (let key in continuers) { step(limit, false, key); }
function step(i, present, key) {
let start = starts[key];
if (start !== null) {
if (!present) {
let end = i - 1;
let continuersList = continuers[key];
starts[key] = null;
let li = document.createElement("li");
let range = (start === end)
? `U+${start.toString(16)}`
: `U+${start.toString(16)} - U+${end.toString(16)}`;
let chars = (start === end)
? String.fromCharCode(start)
: `${String.fromCharCode(start)}-${String.fromCharCode(end)}`;
li.appendChild(document.createTextNode(`${range}: [${chars}]`));
continuersList.appendChild(li);
}
} else if (present) {
starts[key] = i;
}
}
})();
</script>
The following input
/test/?param1=valueOne¶m2=valueTwo
will be sanitized to:/test/?param1=valueOne¶m2=valueTwo
but should be sanitized to/test/?param1=valueOne&param2=valueTwo
The following code is used: