geolexica / geolexica-server

Generalized backend for Geolexica sites
2 stars 1 forks source link

Convert mathematical formulae to MathML #108

Open skalee opened 4 years ago

skalee commented 4 years ago

Some concepts may contain mathematical symbols and formulas in their designations, descriptions, or notes. Formulas can be expressed either in LaTeX math, AsciiMath, or MathML. It is also preferred that concepts follow AsciiDoc stemming syntax with stem, asciimath, and latexmath macros.

Available converters

There are some programs which come handy:

AsciiMath gem

A handy gem which converts AsciiMath to MathML. AsciiDoctor relies on it when processing stem macros (optional dependency). Does job pretty well, however does not convert LaTeX math strings. There is no corresponding gem for LaTeX math.

LaTeXML

A toolset for processing LaTeX documents. Most importantly, it contains latexmlmath program, which converts LaTeX math formulas to MathML. Sadly, this program fails to recognize some symbols, e.g. \backepsilon. Perhaps this can be fixed with proper configuration.

Example: latexmlmath '\sqrt{b^2-4ac}'

Pandoc

Pandoc is capable of converting LaTeX math to MathML, though it must be wrapped in a Markdown document. We can craft a minimalistic Markdown document and then extract MathML formula from generated HTML.

Example: echo '$$\sqrt{b^2-4ac}$$' | pandoc --mathml -f markdown -t html

MathJax

MathJax converts both AsciiMath and LaTeX math to MathML. It is designed to be run in browser primarily, but works in NodeJS too. The problem is that it is poorly documented, and API docs are non-existent. There are some usage examples in https://github.com/mathjax/MathJax-demos-node which present working solutions. Following two snippets use programs from that repository:

Example: node -r esm component/tex2mml \\sqrt{b^2-4ac} (LaTeX math -> MathML) Example: node -r esm component/am2mml 'sqrt(b^2-4ac)' (AsciiMath -> MathML)

Performance considerations

Executing a program per each formula on site may hamper site generation time. LaTeXML, Pandoc and MathJax have been benchmarked with hyperfine:

hyperfine -m 100 'latexmlmath \\sqrt{b^2-4ac}'
Benchmark #1: latexmlmath \\sqrt{b^2-4ac}
  Time (mean ± σ):      1.504 s ±  0.022 s    [User: 1.383 s, System: 0.108 s]
  Range (min … max):    1.481 s …  1.597 s    100 runs
hyperfine -m 100 'echo \$\$\\sqrt{b^2-4ac}\$\$ | pandoc  --mathml -f markdown -t html'
Benchmark #1: echo \$\$\\sqrt{b^2-4ac}\$\$ | pandoc  --mathml -f markdown -t html
  Time (mean ± σ):      39.9 ms ±   1.7 ms    [User: 12.5 ms, System: 15.7 ms]
  Range (min … max):    37.2 ms …  53.6 ms    100 runs
hyperfine -m 100 'node -r esm component/tex2mml \\sqrt{b^2-4ac}'
Benchmark #1: node -r esm component/tex2mml \\sqrt{b^2-4ac}
  Time (mean ± σ):     646.5 ms ±  14.6 ms    [User: 626.3 ms, System: 90.1 ms]
  Range (min … max):   627.1 ms … 713.7 ms    100 runs

Integration considerations

We can call any of these programs from Ruby by creating a subshell. However, it will be very time-consuming for MathJax, and especially for LaTeXML.

Final considerations

We would love to integrate LaTeXML as we have our part in its development, however this seems to be the most difficult of all above. We need to turn it into a gem, and resolve issues with unrecognized symbols. Perhaps in a longer run… unless we have a gem already?

ronaldtse commented 4 years ago

LaTeXML probably can be turned into a gem with native extensions, but this requires some work.

You can do this, and if it works, it will be useful in Metanorma as well.

Metanorma uses the LaTeXML installation separately via package managers. In the docker image it uses CPAN, in other situations the Snap package and the Chocolatey package.

ronaldtse commented 4 years ago

@skalee for LaTeX math, ONLY LaTeXML is deterministically accurate and correct (i.e. it always arrives at the correct structure), even though it is slower than others. It is also necessary to use the same processor being used in Metanorma because the terminology site software is part of our standardization suite.

skalee commented 4 years ago

@skalee for LaTeX math, ONLY LaTeXML is deterministically accurate and correct (i.e. it always arrives at the correct structure), even though it is slower than others. It is also necessary to use the same processor being used in Metanorma because the terminology site software is part of our standardization suite.

Okay, these are strong arguments. I'll experiment with LaTeXML then.

Regarding bridging LaTeXML as native extension: Initially I thought that LaTeXML is written in C, but now I see it's in Perl. This makes everything difficult. Resources on the topic are scarce, if any. We're literally entering uncharted waters and I doubt we'll succeed, especially that I don't know Perl at all. Nevertheless, I'll be happy to try. (update: this is very old, but looks promising: ruby-perl)

However, we can still call LaTeXML from a subshell, and we can avoid repetitive calls by caching the results. This should improve performance greatly, especially if we use a disk case in order to persist it between builds. At the moment I'm pretty convinced we'll end up with subshell calls.

Having said that, I still don't know what to do with missing entities like \backepsilon. Following formula is taken directly from concept 259 "isomorphism".

latexmlmath '[A,B \textit{ isomorphic}] \Leftrightarrow [\exists f : A \rightarrow B, g : B \rightarrow A \backepsilon f \circ g = Id_A, g \circ f = Id_B]'

On my computer, it ends up with one error (Error:undefined:\backepsilon The token T_CS[\backepsilon] is not defined) and one warning (Warning:not_parsed:UNKNOWN.ATOM.CLOSE>METARELOP MathParser failed to match rule 'Anything'). Produced MathML is as follows (note merror element):

<?xml version="1.0" encoding="UTF-8"?>
<math xmlns="http://www.w3.org/1998/Math/MathML" alttext="[A,B\textit{ isomorphic}]\Leftrightarrow[\exists f:A\rightarrow B,g:B%&#10;\rightarrow A\backepsilon f\circ g=Id_{A},g\circ f=Id_{B}]" display="block">
  <mrow>
    <mrow>
      <mo stretchy="false">[</mo>
      <mi>A</mi>
      <mo>,</mo>
      <mi>B</mi>
      <mtext mathvariant="italic"> isomorphic</mtext>
      <mo stretchy="false">]</mo>
    </mrow>
    <mo>⇔</mo>
    <mrow>
      <mo stretchy="false">[</mo>
      <mo>∃</mo>
      <mi>f</mi>
      <mo>:</mo>
      <mi>A</mi>
      <mo>→</mo>
      <mi>B</mi>
      <mo>,</mo>
      <mi>g</mi>
      <mo>:</mo>
      <mi>B</mi>
      <mo>→</mo>
      <mi>A</mi>
      <merror class="ltx_ERROR undefined undefined">
        <mtext>\backepsilon</mtext>
      </merror>
      <mi>f</mi>
      <mo>∘</mo>
      <mi>g</mi>
      <mo>=</mo>
      <mi>I</mi>
      <msub>
        <mi>d</mi>
        <mi>A</mi>
      </msub>
      <mo>,</mo>
      <mi>g</mi>
      <mo>∘</mo>
      <mi>f</mi>
      <mo>=</mo>
      <mi>I</mi>
      <msub>
        <mi>d</mi>
        <mi>B</mi>
      </msub>
      <mo stretchy="false">]</mo>
    </mrow>
  </mrow>
</math>

You can copy-paste it to MathJax demo.

skalee commented 4 years ago

@ronaldtse I still have troubles with LaTeXML. Does anyone know how to fix error produced by following command (Error:undefined:\backepsilon)?

latexmlmath '[A,B \textit{ isomorphic}] \Leftrightarrow [\exists f : A \rightarrow B, g : B \rightarrow A \backepsilon f \circ g = Id_A, g \circ f = Id_B]'
ronaldtse commented 4 years ago

@skalee Please check usage of latexmlmath in the metanorma gem. Backepsilon is recognized there.

ronaldtse commented 2 years ago

We now have the plurimath gem that can do all of the above conversions. Thanks @suleman-uzair!