IACR / latex-submit

Web server to receive uploaded LaTeX and execute it in a docker container.
GNU Affero General Public License v3.0
11 stars 0 forks source link

provide MATHML or CDATA LaTeX in crossref XML registration #60

Closed kmccurley closed 2 months ago

kmccurley commented 8 months ago

When we report data to crossref, we need to properly encode mathematics in titles, abstracts, and references. Crossref appears to accept both TeX and MATHML in abstracts and titles, but their examples use mathml. If we encode math as LaTeX then we have to be careful to put it into a CDATA XML section because it may contain characters that are problematic in XML, namely <, >, &. It seems that there is a python converter latex2mathml from LaTeX to MATHML, but if you run it on the entire title, it seems to produce junk, because convert('This is text') will produce

<math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><mrow><mi>T</mi><mi>h</mi><mi>i</mi><mi>s</mi><mi>i</mi><mi>s</mi><mi>t</mi><mi>e</mi><mi>x</mi><mi>t</mi></mrow></math>

It's possible to just detect the sections of the title and abstract that are in math mode, and convert those to mathml. Alternatively, we can try to encode the math sections as CDATA sections. I looked at what others do, and I found some bad behavior:

  1. Springer encodes mathematics in titles inside $$ ... $$ instead of $ ... $. See this example
  2. cambridge seems to report abstracts in mathml, but their titles are sometimes truncated when they include mathematics. See https://api.crossref.org/works/10.1017/jpr.2022.54 and https://www.cambridge.org/core/journals/journal-of-applied-probability/article/abs/boseeinstein-condensation-for-particles-with-repulsive-shortrange-pair-interactions-in-a-poisson-random-external-potential-in-mathbbrd/23F5CC18E910D51740206EFFE3181030
  3. elsevier seems to report titles in mathml. They do not report abstracts.
  4. AMS reports abstracts in mathml.
  5. London math society has title in TeX and abstract in mathml.

The safest thing is perhaps to encode the mathematics sections as LaTeX in CDATA. Unfortunately python ElementTree does not directly support CDATA output. See this.

Whatever we do, this will require extensive testing.