cure53 / DOMPurify

DOMPurify - a DOM-only, super-fast, uber-tolerant XSS sanitizer for HTML, MathML and SVG. DOMPurify works with a secure default, but offers a lot of configurability and hooks. Demo:
https://cure53.de/purify
Other
13.67k stars 698 forks source link

MathML : nested <mi> content being removed ? #847

Closed marc-polizzi closed 1 year ago

marc-polizzi commented 1 year ago

This issue proposes a bug...

Background & Context

KaTeX generates MathML <mi> containing both a <mi> and <mpadded> tags when converting the vdots (vertical dots) function.

Bug

The content of the <mi> is removed.

Input

<math xmlns="http://www.w3.org/1998/Math/MathML">
    <semantics>
        <mrow>
            <mi>Y</mi>
            <mo>=</mo>
            <mi>
                <mi mathvariant="normal">⋮</mi>
                <mpadded height="0em" voffset="0em">
                    <mspace mathbackground="black" width="0em" height="1.5em"></mspace>
                </mpadded>
            </mi>
        </mrow>
        <annotation encoding="application/x-tex">Y = \vdots</annotation>
    </semantics>
</math>

Notice the second <mi> content.

Given output

<math display="block" xmlns="http://www.w3.org/1998/Math/MathML">
    <semantics>
        <mrow>
            <mi>Y</mi>
            <mo>=</mo>
            <mi></mi>
        </mrow>
        <annotation encoding="application/x-tex">Y = \vdots</annotation>
    </semantics>
</math>

The second <mi> is empty.

Expected output

Same as input.

cure53 commented 1 year ago

It seems that the browsers don' accept a <mi> inside a <mi>. the first <mi> is in MathML namespace, the nested one in HTML. That is seen as an attack by DOMPurify and hence it removes the nested <mi>.

Since this is browser behavior, we cannot fix that, but I am certain you can fix that with a hook if you are sure no attacks can happen that way :slightly_smiling_face:

You can test that with this tool: https://livedom.bentkowski.info/

As you can see, the nested element is HTML, not MathML, this is the key problem.

marc-polizzi commented 1 year ago

@cure53 I've noticed this behavior in the DOMPurify code but was not sure to understand all of it. I'm a bit suprised as this MathML example is the code as generated by the KaTeX/LaTeX \vdots function itself. Do you mean, this generated code is somehow invalid ?

I'll check the hooks and see how to fix this issue.

cure53 commented 1 year ago

I am tempted to say yes, this is pretty certainly invalid MathML.

The element in MathML represents a single mathematical identifier (e.g., a variable or symbol), and nesting one inside another is not a valid use of the element according to the MathML specification.

Each element should contain a single identifier, and if you need to represent a combination of identifiers or expressions, you should use appropriate MathML elements like (for operators) or (for grouping expressions).

marc-polizzi commented 1 year ago

@cure53 thank you. I'm trying to fix it with a hook.

cure53 commented 1 year ago

Cool, please do let know if help is needed, but should be straight forward :slightly_smiling_face:

I would recommend: Use an element hook such as uponSanitizeElement, check if you are inside a <mi> and if the element next is a <mi> as well, transform the nested <mi> into something legal, then (of even necessary) change back nested content to be <mi> after sanitization.