KWARC / LaTeXML

LaTeXML is a TeX and LaTeX to XML translator.
Other
4 stars 0 forks source link

use OMFOREIGN instead of OMSTR for mixed markup (\text{...}) #1

Open kohlhase opened 9 years ago

kohlhase commented 9 years ago

It is great that we have a good treatment of embedded text (amstext-style) for OpenMath now. But it uses the wrong encoding. If I convert

\documentclass{article}
\usepackage{amstext}
\begin{document}
$\{\text{$\phi$ and $\psi$}\}$
\end{document}

I get

      <om:OMOBJ>
    <om:OMA>
      <om:OMS cd="latexml" name="set"/>
      <om:OMSTR xml:id="XM1">
        <om:OMV name="ϕ" xml:id="XM1a"/> and <om:OMV name="ψ" xml:id="XM1b"/>
      </om:OMSTR>
    </om:OMA>
      </om:OMOBJ>

But according to the OpenMath 2 standard OMSTR is for strings, not for recursing into other content. for that we need OMFOREIGN (which is only allowed in OMATTR). So I would like to have instead:

      <om:OMOBJ>
    <om:OMA>
      <om:OMS cd="latexml" name="set"/>
      <om:OMATTR>
        <om:OMATP>
          <om:OMS cd="OMDoc" name="verbalizes"/>
          <om:FOREIGN encoding="mtext" xml:id="XM1">
        <om:OMV name="ϕ" xml:id="XM1a"/> and <om:OMV name="ψ" xml:id="XM1b"/>
          </om:FOREIGN>
        </om:OMATP>
        <om:OMS cd="OMDoc" name="infObj"/>
      </om:OMATTR>
    </om:OMA>
      </om:OMOBJ>

@m-iancu:fyi

kohlhase commented 9 years ago

Deyan, I think that the change I want above is quite small, but I do not understand the code enough. I think that we have to do some more construction of the elements in line 175. Could you please do that? I would be very grateful.

kohlhase commented 9 years ago

HMMMM, if I think about this, I think that we want to have OMOBJ elements in the inferior Math. Concretely, instead of

 <om:FOREIGN encoding="mtext" xml:id="XM1">
  <om:OMV name="ϕ" xml:id="XM1a"/> and <om:OMV name="ψ" xml:id="XM1b"/>
</om:FOREIGN>

I think we should have

 <om:FOREIGN encoding="mtext" xml:id="XM1">
  <om:OMOBJ><om:OMV name="ϕ" xml:id="XM1a"/></om:OMOBJ>
   and 
   <om:OMOBJ><om:OMV name="ψ" xml:id="XM1b"/></om:OMOBJ>
</om:FOREIGN>
kohlhase commented 9 years ago

Oh, I probably should say that we can only make progress on the glossary presentation (things are quite ugly there) once this is fixed :-)

kohlhase commented 9 years ago

assigning to @angerhang I think that he can do this now (and it has high priority). @dginev it would be very good if you could help him where necessary.

angerhang commented 9 years ago

Currently I am unable to produce the same output here, as I also encountered #3.

However, the error seems to vanish if we only have one formulae:

$\{\text{$\phi$}\}$

or

$\{\text{$\psi$}\}$
kohlhase commented 9 years ago

I am not sure what you mean with this. I guess that the "vanish" is a description of #3, so put the observation there.

kohlhase commented 9 years ago

Well, if you can test with that, then by all means test with that, as long as we get a resolution on this issue (and on #3).

angerhang commented 9 years ago

What I experienced earlier is when I used the code snippet @kohlhase provided, I had error messages like this:

Fatal:perl:die Perl died
Postprocessing LaTeXML::Post::OpenMath test
Can't locate object method "getAttribute" via package "XML::LibXML::Text" at /opt/local/lib/perl5/site_perl/5.16.3/LaTeXML/Post.pm line 454.
make: *** [test.omdoc] Error 255
angerhang commented 9 years ago

But when we have only one symbol/formulae:

\documentclass{article}
\usepackage{amstext}
\begin{document}
$\{\text{$\phi$ }\}$
\end{document}  

And the output is:

<om:OMOBJ>
    <om:OMA>
         <om:OMS cd="latexml" name="set"/>
               <om:OMSTR>
                   <om:OMV name="ϕ"/>
               </om:OMSTR>
     </om:OMA>
</om:OMOBJ>

Although the result lacks of ID, still is suitable for testing.

brucemiller commented 9 years ago

I seem to get notifications from these issues, too. The markup seems odd to expect sensible OM from, but it does seem to generate particularly nonsensical OM. Actually, the current LaTeXML produces something different from Michael's original report, but still wrong.

I've made some patches, I don't think it's yet quite right, but maybe I'll go ahead and commit it. But I also wonder about the two OMDoc symbols, verbalizes and infObj, which don't seem too sensible in a generic OpenMath context.... or?

kohlhase commented 9 years ago

Dear Bruce, I fear that we are referring to the KWARC fork of LaTeXML with this.

kohlhase commented 9 years ago

Bruce, I have no idea why you are receiving issues on this, but while you are, let me answer, ...

But I also wonder about the two OMDoc symbols, verbalizes and infObj, which don't seem too sensible in a generic OpenMath context.... or?

You are right, they are a bit weird. But I need two symbols to make things valid OpenMath, and since we currently only use this in the OMDoc context I just used that. But I can also make a CD for that and publish it. If someone makes the changes, I volunteer to make the CD and get it published but the OMSoc.

angerhang commented 9 years ago

Finally I managed to get the output close to what we want, the tex file is:

\documentclass{article}
\usepackage{amstext}
\begin{document}
$\{\text{$\phi$ }\}$
\end{document}  

And the output is:

    <om:OMOBJ>
        <om:OMA>
            <om:OMS cd="latexml" name="set"/>
            <om:OMTTR>
                <om:OMATP>
                    <om:OMS cd="OMdoc" name="verbalizes"/>
                    <om:FOREIGN encoding="mtext">
                       <om:OMOBJ><om:OMV name="ϕ"/></om:OMOBJ>
                    </om:FOREIGN>
                </om:OMATP>
            </om:OMTTR>
        </om:OMA>
    </om:OMOBJ>

We can only test on multiple symbols after #3 is addressed.

But I really got stuck on how to get something like:

               </om:FOREIGN>
            </om:OMATP>
        <om:OMS cd="OMDoc" name="infObj"/>
    </om:OMATTR>
</om:OMA>

Since the all the tags are auto closed, how can we tell latexml to insert something before the </om:OMATTR > tag?

@dginev do you think you can point out where to look at? It would be really helpful :)

dginev commented 9 years ago

I am not following closely the examples, but here is the general setup:

If you are generating math from sTeX, you need to be examining the intermediate XML format, before post-processing takes place. I am unsure if lmh has a target for that, but it should if it is to support sTeX development.

In some cases you would be producing the right intermediate math, but the post-processing to OpenMath would fail. That is possible, as LaTeXML's OM support is limited and we have several KWARC extensions (all experimental) that could be tripping up. For that you need to improve on the OpenMath.pm post-processor in KWARC/LaTeXML.

brucemiller commented 9 years ago

Whether or not I'm supposed to be here, the given example should "work" w/o sTeX, and the core LaTeXML is producing nonsense. I've made a partial patch to deal with XMText with nested math; it's not quite right, probably, since it puts the entire XMath tree under there, but it's a start. I just copied your OMDoc symbols --- probably nobody will notice for now :> Ultimately, it seems like more candidates for an "underspecified" or "ambiguous" CD.

dginev commented 9 years ago

If you have time to be here, does that mean you have time to merge my pull requests in the LaTeXML repo? :> It's great to have you in the sTeX-related discussions!

Your commit looks exactly spot on, I just need to figure out why you left the conversion of the nested elements commented out. It would seem that would be nice to have turned on, so that the nested math gets converted too?

brucemiller commented 9 years ago

No, I only have time to do the easy stuff! :> Seriously, that was next on my list, and why I didn't get the nesting completely right (see comment under the commit).

dginev commented 9 years ago

Ok, so I merged it and uncommented the OM conversion of nested elements. All: If math-in-text breaks in sTeX, please report it here so that Bruce can fix it :>

kohlhase commented 9 years ago

I have updated and now get

    <om:OMOBJ xmlns:om="http://www.openmath.org/OpenMath">
      <om:OMA>
    <om:OMS cd="latexml" name="set"/>
    <om:OMATTR>
      <om:OMATP cd="OMDoc" name="verbalizes">
        <om:FOREIGN encoding="mtext">
          <om:OMSTR>&#x3D5;&#x3D5;</om:OMSTR>
          and
          <om:OMSTR>&#x3C8;&#x3C8;</om:OMSTR>
        </om:FOREIGN>
      </om:OMATP>
      <om:OMS cd="OMDoc" name="infObj"/>
    </om:OMATTR>
      </om:OMA>
    </om:OMOBJ>

which is much better, but still not what I want. Instead of the strings,

 <om:OMSTR>&#x3D5;&#x3D5;</om:OMSTR>

I want to have

<om:OMOBJ><om:OMV name="ϕ" xml:id="XM1a"/></om:OMOBJ>
kohlhase commented 9 years ago

BTW, I am not sure that I am actually using the right version here. The log file says

latexmlc (LaTeXML version 0.8.1 (KWARC fork); revision 1d295a9)
processing started Thu Apr 16 19:24:35 2015

and https://github.com/KWARC/LaTeXML gives the revision 1d295a9e30fd24aaf918888fdd4589a921dbef35 but I guess that is only truncated.

dginev commented 9 years ago

Yes, you're using the correct version. Will look into this in a second, today is a good day for LaTeXML work.

kohlhase commented 9 years ago

Oh, and I should also say, obviously, Perl does not die, which is very good.

dginev commented 9 years ago

@brucemiller Actually, nested ltx:Math elements have always confused me and still confuse me today. Especially in a post-processing conversion with parallel math formats.

It seems right now we do the parallel post-processing for nested nodes as well, which I don't understand - it seems wrong, because we get a multiplying out of the representations. Shouldn't our math selectors in LaTeXML::Post::toProcess be:

  return $doc->findnodes('//ltx:Math[not(ancestor::ltx:Math)]'); }

so that the nested math elements are only reachable via traversing the DOM while rewriting? At the moment we have the top level process calls executed also on the nested math blocks, which seems confusing and possible damaging.

What I am thinking of is:

brucemiller commented 9 years ago

Well, Deyan uncommented the code which generates OM elements, but without the contiaining OMOBJ. The code I had submitted, creates the OMOBJ, but with an extra Math around it. That doesn't seem quite right, bit is righter; After all, in general, the content of the OMFOREIGN will be LaTeXML markup, although in this case its simply a string.

kohlhase commented 9 years ago

I am not sure I understand the perl-level comments here, but let me say that I indeed want to have the containing OMOBJ in there. It just seems more correct. I also think that the MathML you are generating should have elements around the math, which (I think) it currently does not. So in summary, I think that the correct output for our example should be

     <om:OMOBJ>
    <om:OMA>
      <om:OMS cd="latexml" name="set"/>
      <om:OMATTR>
        <om:OMATP>
          <om:OMS cd="OMDoc" name="verbalizes"/>
          <om:FOREIGN encoding="mtext" xml:id="XM1">
            <om:OMOBJ>
               <om:OMV name="ϕ" xml:id="XM1a"/>
             </om:OMOBJ>
             and 
             <om:OMOBJ>
                 <om:OMV name="ψ" xml:id="XM1b"/>
             </om:OMOBJ>
          </om:FOREIGN>
        </om:OMATP>
        <om:OMS cd="OMDoc" name="infObj"/>
      </om:OMATTR>
    </om:OMA>
      </om:OMOBJ>

and correspondingly for the MathML.

brucemiller commented 9 years ago

You do need the OMOBJ in there, but I'm not so clear that that is all you need. om:FOREIGN says that the markup is not OM, but doesn't really say, here, what it is. Initially, it's going to be LaTeXML! To see this, wrap the "and" in your original example with \textbf, for example.

If you process that with latexml from my repo, you'll get something somewhat more legal (the OMOBJ is wrapped in an ltx:Math). But I'm not so clear that is ideal either: at the very least, it's mangled any potential parallel markup; and the XSLT doesn't yet know that it has to recurse into OM:FOREIGN.

BTW & FWIW: cmml is suffering similar issues for this case.

kohlhase commented 9 years ago

Yes, I agree with all you say. The fact that om:OMFOREIGN does not say what it is means to me that we are free to put in there whatever we want. The example \text{$\phi$} Hang was using was degenerate, I do not think that someone in their right mind would write that in real texts.

Indeed, I ''want'' the content of om:OMFOEIGN to be LTXML, and then I want to convert it to the same format that the overall document has. In the case of HTML5, I want the content of the OMFOREIGN to be HTML5, in the case of the sTeX conversion I want the content of the om:OMFOREIGN to be OMDoc (the inline fragment of OMDoc really, but that is beside our point currently).

So if I understand you correctly, then the main holdup is that XSLT recurses? I think that should be relatively simple right?

kohlhase commented 9 years ago

Ah, I think I slowly understand what you are saying. If I want to have XSLT recurse I should do that in my own post-processing stylesheet. You are of course right. But does the OpenMath postprocessing recurse into OMFOREIGN now? That would be the prerequisite.

kohlhase commented 9 years ago

So I guess all I want is to have the OpenMath post-processor return me something of the form

<om:FOREIGN encoding="mtext">
  <ltx:Math><om:OMOBJ>....</ltx:Math> and <ltx:Math><om:OMOBJ>....</ltx:Math>
</om:FOREIGN>

From that I can recurse no problem. But currently, I am calling it with

latexmlc --openmath --format=xml --post --output tost.xml --log tost.pdflog tost.tex

I am getting

<om:FOREIGN encoding="mtext"><om:OMSTR/> and <om:OMSTR/></om:FOREIGN>
brucemiller commented 9 years ago

I think, from what Deyan said, that the kwarc fork is using the "wrong" code here. You'll get what I proposed using my fork of LaTeXML.

kohlhase commented 9 years ago

Oh, how wonderful, I finally understand and Indeed, it does just what you claim. Now, I can fix the postprocessing. @dginev would you please do the merge into the KWARC branch soon? That would be great. A great thanks to both of you. @bruce, we should probably also let the XSLT recurse into the MathML version of foreign.

kohlhase commented 9 years ago

Ah, I just see that my sTeX stylesheet already recurses, so if we merge your changes into the KWARC branch, the issue is fixed. Hooray!

brucemiller commented 9 years ago

well, it's sorta fixed, but I'm not so sure it's what we really want. If you generate html w/pmml+om, for example, you get a top-level semantics, with pmml, then OM, but within the OM is an om:FOREIGN with html(?) and it's own m:semantics (w/pmml & OM). You might expect html + pure OM within the FOREIGN?? Or maybe not?? I'm not sure.

kohlhase commented 9 years ago

hmmm, you are right, But I think that it is reasonable to expect the same (i.e. html w/pmml+om) at all levels. If you do not want that, just postprocess some more :-).

kohlhase commented 9 years ago

@dginev I looked a the two versions of the openmath postprocessor, and I do not feel qualified to merge :-(. I am afraid this is on you.

dginev commented 9 years ago
well, it's sorta fixed, but I'm not so sure it's what we really want. 

Same here. I really don't want to see the parallel processors recursing into nested math elements and multiplying themselves out. If you have:

$1 \text{ then $2 \text{ then $3$}}$

With pmml, cmml and the tex annotations, then we will have four identical parallel math elements for the innermost math, each with the 3 formats. That gets exponentially slow/redundant as the nesting deepens.

I would really like to come up with a solution that never does parallel processing in nested ltx:Math elements, then we can move forward.

dginev commented 9 years ago

For me the only sane technique here is that once you start rewriting XMath into OpenMath, you consistently do so from the top-down. If anyone needs a different representation - that is fine, Bruce has added a wonderful cross-referencing scheme to that end. The tricky bit with my suggestion may be making sure that the cross-refs will work out nicely, not sure if there is anything iffy about the nested math elements. Ideally not.

brucemiller commented 9 years ago

Great; we're on the same wavelength. And in fact, that's why I left the commented-out code in there as a hint for how it perhaps ought to be handled correctly. I have an idea to try out...

brucemiller commented 9 years ago

I think I've got this sorted out; it works along the lines of what Deyan suggested. Needed to process only top-level math at the main mathprocessor level, and a helper method to recurse into XMText and generate the ltx markup, hopefully w/o id clashes, and transform the nested math. Seems to work correctly for cmml & om.

A warning, however: if you get HTML5 nested within OM within a semantics in HTML5, the HTML5 parser makes hash out of it. I think that's to be expected, however unfortunate.

Hope this merges well with your sTex stuff.

dginev commented 9 years ago

Merged, tested and it looks perfect! Thanks a lot Bruce, that is exactly the type of enhancement I was thinking of, but I was lacking the oversight to see the fine details involved. I will study your exact changes later during the week, it's quite educational.

@kohlhase please test, if things behave well on your end we should be able to close here.

angerhang commented 9 years ago

Dear all, It turned out I was/am really naive about LaTeXML, for when I tried to solve this issue the first place, I was only faking the appearance of the output, but didn't dig into the underlying infrastructure.

Thanks for all these educational comments. I will study them closely this week : -)

dginev commented 9 years ago

@angerhang You should keep in mind that LaTeXML is an iceberg of an application. You can glide along the basic interface with relative ease, but there are 20,000+ lines of Perl code lurking under, and you should always ask us for help if you end up swimming in the deep, at least until you feel comfortable in there.

Of course the real complexity comes from reimplementing the interpreter for one of the most convoluted Turing complete programming languages ever invented :> The joys of TeX.