brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
874 stars 93 forks source link

add a unicode-math binding #1460

Open xworld21 opened 3 years ago

xworld21 commented 3 years ago

Best explained with an example:

\[ a × b \]

is compiled to

<XMath>
  <XMApp>
    <XMTok meaning="times" role="MULOP">⁢</XMTok>
    <XMTok font="italic" role="UNKNOWN">a</XMTok>
    <XMTok role="UNKNOWN">×</XMTok>
    <XMTok font="italic" role="UNKNOWN">b</XMTok>
  </XMApp>
</XMath>

instead of

<XMath>
  <XMApp>
    <XMTok meaning="times" role="MULOP">×</XMTok>
    <XMTok font="italic" role="UNKNOWN">a</XMTok>
    <XMTok font="italic" role="UNKNOWN">b</XMTok>
  </XMApp>
</XMath>

and then gets rendered all wrong in the browser. I see the same for delimiters too.

I don't suppose there are many other people using Unicode for this purpose, but... it would be nice if LaTeXML recognised Unicode operators (and delimiters, relations, punctuation, etc. [1]), if not all of the current ones, at least the ones it is able to output on its own.

[1] https://www.unicode.org/Public/math/revision-15/MathClassEx-15.html [2] http://mirror.ctan.org/macros/latex/contrib/unicode-math/unimath-symbols.pdf

dginev commented 3 years ago

Great catch, we should patch this right away - thanks!

xworld21 commented 3 years ago

@dginev it turns out that #1462 works and fixes my use case (I have lots of \DeclareUnicodeCharacter's). I realised that LaTeXML is in a weird spot regarding Unicode because it does not behave like any of the other engines. Normal LaTeX will error out on Unicode characters not defined by inputenc, or not declared via \DeclareUnicodeCharacter. LuaTeX will silently ignore Unicode mathematical symbols, unless you load the unicode-math package, in which case it works as you would expect (just tested this with ⊕ inside an equation). What's the general strategy regarding compatibility with the other engines?

dginev commented 3 years ago

Generally latexml tries to retain, and do the best possible treatment, for any Unicode it encounters. So it's closer to xelatex than latex for Unicode specifically, at least in intention.

brucemiller commented 2 years ago

So, if I understood correctly, this one is fixed?

dginev commented 2 years ago

No? Not started actually.

latexmlc 'literal:a × b' --whatsin=math  --whatsout=math --pmml --cmml  --dest=test.html
<math id="p1.m1" class="ltx_Math" alttext="a\texttimes b" display="inline">
  <semantics>
    <mrow id="p1.m1.5" xref="p1.m1.5.cmml">
      <mi id="p1.m1.1" xref="p1.m1.1.cmml">a</mi>
      <mo id="p1.m1.2" xref="p1.m1.2.cmml">⁢</mo>
      <mi mathvariant="normal" id="p1.m1.3" xref="p1.m1.3.cmml">×</mi>
      <mo id="p1.m1.2a" xref="p1.m1.2.cmml">⁢</mo>
      <mi id="p1.m1.4" xref="p1.m1.4.cmml">b</mi>
    </mrow>
    <annotation-xml encoding="MathML-Content">
      <apply id="p1.m1.5.cmml" xref="p1.m1.5">
        <times id="p1.m1.2.cmml" xref="p1.m1.2"></times>
        <ci id="p1.m1.1.cmml" xref="p1.m1.1">𝑎</ci>
        <ci id="p1.m1.3.cmml" xref="p1.m1.3">×</ci>
        <ci id="p1.m1.4.cmml" xref="p1.m1.4">𝑏</ci>
      </apply>
    </annotation-xml>
  </semantics>
</math>
dginev commented 2 years ago

It could be too large in general to finish in 0.8.6, but we may at least handle the times case before the release (or such was my plan)

brucemiller commented 2 years ago

I'm a bit uncertain about the wisdom of it, in the sense that for all its specificity Unicode is often quite vague (or even wrong). However, a potential clue for making something that's actually achievable is in something @xworld21 said: the symbols that LaTeXML already knows about.

So, rather than manually trying to come up with a whole long list, we should just make DefMath do double duty: If it is defining a \cs to produce a single Unicode symbol (w/various attributes), we should also make the raw unicode symbol map to that \cs.

brucemiller commented 2 years ago

Such quick agreement should be its own red flag! :>

The basic implementation is likely easy, but there are some caveats to watch out for, like compatibility, previous definitions and such. Consider that using pdflatex from TeXlive 2018 gives "Unicode character... not set up for use with LaTeX", whereas in 2020, I get "\texttimes invalid in math mode". Apparently the encoding tables are now built-in. (fwiw: xelatex silently drops the symbol).

So, I'd rather not rush a randomly chosen solution without knowing the impact; so maybe this easy fix is better bumped to the next Milestone?

dginev commented 2 years ago

If it's not obvious, it's just inevitable we need to shelve it - could be an early addition for the next batch of issues. The release momentum has its own logic to follow :>

And indeed, how much should we emulate TeX, how much should we improve on it... We generally act as xelatex inasmuch as accepting unicode successfully without special incantations, so we probably should follow its behavior to some degree.

xworld21 commented 2 years ago

we need to shelve it - could be an early addition for the next batch of issues.

This issue could be converted (or closed and reopened) into a request for a unicode-math binding. That would be very unambiguous and match most users' expectations.