cure53 / DOMPurify

DOMPurify - a DOM-only, super-fast, uber-tolerant XSS sanitizer for HTML, MathML and SVG. DOMPurify works with a secure default, but offers a lot of configurability and hooks. Demo:
https://cure53.de/purify
Other
13.98k stars 723 forks source link

&para gets transformed to ¶ #995

Closed kaelig closed 1 month ago

kaelig commented 1 month ago

Background & Context

I found a bug concerning how HTML entities like &para (without the trailing ;) are being handled during sanitization, especially regarding how the clean output reflects the intended display of entities like the paragraph symbol ().

Bug

Input

The input HTML thrown at DOMPurify:

<span>&para</span>

Given output

The output given by DOMPurify:

<span>¶</span>

Expected output

The expected output:

<span>&amp;para</span>

DOMPurify appears to be converting the &para entity into its equivalent Unicode symbol () in the cleaned HTML. However, the expectation is for the original HTML entity &para; to remain intact without being converted to the symbol.

kaelig commented 1 month ago

Funnily enough, transformations in URLs is correct.

Input

<a href="https://example.com/?foo=bar&para=baz">https://example.com/?foo=bar&para=baz</a>

Given output

<a href="https://example.com/?foo=bar&amp;para=baz">https://example.com/?foo=bar¶=baz</a>

Expected output

<a href="https://example.com/?foo=bar&amp;para=baz">https://example.com/?foo=bar&amp;para=baz</a>
cure53 commented 1 month ago

That is done by the browser or DOM engine DOMPurify uses, not by DOMPurify itself. Sadly, we cannot fix this as this is fully expected behavior and related how the browser deals with named HTML entities.

kaelig commented 1 month ago

The fact that an entity (without the trailing ;) gets transformed is a strange bug.

Can you please point me to where I should open a bug report?

cure53 commented 1 month ago

I think this is not a bug but specified behavior, see HTML spec.

kaelig commented 1 month ago

The HTML spec does not specify to render &para (without a trailing semicolon) into , so that's got to be a bug.

cure53 commented 1 month ago

https://html.spec.whatwg.org/multipage/named-characters.html

para;   U+000B6     ¶
para    U+000B6     ¶

Here it does :)

kaelig commented 1 month ago

The table you reference could be somewhat misleading. The spec is very clear that the sequence must be terminated by a semicolon character.

https://html.spec.whatwg.org/multipage/syntax.html#syntax-charref

The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (;).

Therefore I maintain that this is a bug.

cure53 commented 1 month ago

Yes, but for legacy reasons, some entities work without as per spec - and para is one :slightly_smiling_face: No :lady_beetle:

kaelig commented 1 month ago

You're right in pointing that browsers are required to support them in their rendering engines for legacy reasons – that said they're non-comforming (all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.), and sadly it's overreaching into the transformed output of DOMpurify.

Although this feels like an undesirable side-effect, I now understand better why you said it's intended as per the spec!