brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
895 stars 96 forks source link

Allow flat xml:id attributes for math #2040

Open dginev opened 1 year ago

dginev commented 1 year ago

This issue requests a setting that changes the preference for math identifiers from hierarchical to global. I can implement the PR if there is interest.

A global id is very simple to realize in the base case. for a document with 1000 formulas, each with 1000 nodes, we would see the id fragments m1 to m999999. We could make it mildly more sophisticated by having a counter for top-level formulas m and a counter for inner formula nodes (maybe xm).

Motivation

I am currently bundling the newest arXMLiv dataset and inspecting the sources. Some bits are jarring even on a tenth encounter. I don't have enough space to paste the full formula on Github (it easily overflows a screen in the source view), but here is the presentation node for a single open parenthesis (from the second section of arXiv:1410.8088):

<mo id="S2.SS0.Ex5.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.2" xref="S2.SS0.Ex5.m1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.cmml">(</mo>

First this is quite jarring when a developer/author first encounters it. Second, when multiplied by a billion formulas, it starts getting taxing on the allocated space for arXMLiv.

Tools like tralics prefer a completely flat scheme, such as:

<mo id="cid4209" xref="cid4992">(</mo>

That seems a bit overboard, though maximizes savings in size. For HTML, one idea is to only go flat for inner math nodes, where the problem is most pronounced, i.e.

<mo id="S2.SS0.Ex5.m1.xm18" xref="S2.SS0.Ex5.m1.xm18.cmml">(</mo>

Edit: here is also a motivating example where the document context prefix gets rather long:

<mi id="S3.SS2.SSS2.p5.10.m10.1.1.2.2" xref="S3.SS2.SSS2.p5.10.m10.1.1.2.2.cmml">F</mi>

Personally, I would be OK with going the extra step further and discarding the document context from math element ids, and instead using only a global counter for math nodes with a secondary counter for the internal nodes:

<mo id="m576.xm18" xref="m576.xm18.cmml">(</mo>

To conclude, there are two problems I am motivated to address:

I am aware that the primary behavior must be kept as-is long term, so that DLMF can continue to be regenerated with the math ids it has today. So this ought to be an optional switch, likely in latexml.sty.

dginev commented 5 months ago

There is also something very verbose happening to IDs for math elements that end up inside SVG diagrams. As an example, here is a screenshot from the browser inspector: image

From a recent arXiv article: https://arxiv.org/html/2402.12530v1#S6.SS2.SSS0.Px2