Closed kaelig closed 1 month ago
Funnily enough, transformations in URLs is correct.
<a href="https://example.com/?foo=bar¶=baz">https://example.com/?foo=bar¶=baz</a>
<a href="https://example.com/?foo=bar&para=baz">https://example.com/?foo=bar¶=baz</a>
<a href="https://example.com/?foo=bar&para=baz">https://example.com/?foo=bar&para=baz</a>
That is done by the browser or DOM engine DOMPurify uses, not by DOMPurify itself. Sadly, we cannot fix this as this is fully expected behavior and related how the browser deals with named HTML entities.
The fact that an entity (without the trailing ;
) gets transformed is a strange bug.
Can you please point me to where I should open a bug report?
I think this is not a bug but specified behavior, see HTML spec.
The HTML spec does not specify to render ¶
(without a trailing semicolon) into ¶
, so that's got to be a bug.
https://html.spec.whatwg.org/multipage/named-characters.html
para; U+000B6 ¶
para U+000B6 ¶
Here it does :)
The table you reference could be somewhat misleading. The spec is very clear that the sequence must be terminated by a semicolon character.
https://html.spec.whatwg.org/multipage/syntax.html#syntax-charref
The ampersand must be followed by one of the names given in the named character references section, using the same case. The name must be one that is terminated by a U+003B SEMICOLON character (
;
).
Therefore I maintain that this is a bug.
Yes, but for legacy reasons, some entities work without as per spec - and para is one :slightly_smiling_face: No :lady_beetle:
You're right in pointing that browsers are required to support them in their rendering engines for legacy reasons – that said they're non-comforming (all named character references are required to end with a semicolon, and uses of named character references without a semicolon are flagged as errors.), and sadly it's overreaching into the transformed output of DOMpurify.
Although this feels like an undesirable side-effect, I now understand better why you said it's intended as per the spec!
Background & Context
I found a bug concerning how HTML entities like
¶
(without the trailing;
) are being handled during sanitization, especially regarding how the clean output reflects the intended display of entities like the paragraph symbol (¶
).Bug
Input
The input HTML thrown at DOMPurify:
Given output
The output given by DOMPurify:
Expected output
The expected output:
DOMPurify appears to be converting the
¶
entity into its equivalent Unicode symbol (¶
) in the cleaned HTML. However, the expectation is for the original HTML entity¶
to remain intact without being converted to the symbol.