cellml / libcellml

Repository for libCellML development.
https://libcellml.org
Apache License 2.0
17 stars 21 forks source link

Improve MathML validation #1082

Closed agarny closed 1 year ago

agarny commented 1 year ago

We use the MathML DTD to validate a MathML string (see the call to XmlDoc::parseMathML() in Validator::ValidatorImpl::validateMath()). However, this is not sufficient to ensure that this is valid MathML. Consider the following MathML string:

<math xmlns="http://www.w3.org/1998/Math/MathML">
    <apply>
        <eq/>
        <ci>y</ci>
        <apply>
            <root/>
            <ci>a</ci>
            <ci>b</ci>
            <ci>c</ci>
        </apply>
    </apply>
</math>

According to the MahML DTD (and https://www.mathmlcentral.com/Tools/ValidateMathML.jsp), it's valid, but... a root element must have either one or two siblings. If there are two siblings then the first sibling must be a degree element with one child corresponding to the degree of the root. The second sibling is what needs to be rooted. If there is only one sibling then it means that we have a square root. In other words:

<math xmlns="http://www.w3.org/1998/Math/MathML">
    <apply>
        <eq/>
        <ci>y</ci>
        <apply>
            <root/>
            <ci>a</ci>
        </apply>
    </apply>
</math>

and:

<math xmlns="http://www.w3.org/1998/Math/MathML">
    <apply>
        <eq/>
        <ci>y</ci>
        <apply>
            <root/>
            <degree>
                <ci>b</ci>
            </degree>
            <ci>a</ci>
        </apply>
    </apply>
</math>

are really valid.

So, upon successful DTD "validation", we should confirm that it is indeed really valiid by doing some extra checks. When it comes to the root element, we should check that it has one or two siblings, etc.

This means that once a MathML string has been really validated, something like the analyser can do whatever it wants with it wthout having to confirm whether a root element is properly used, etc.