brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
908 stars 96 forks source link

Need Content Math test cases #225

Open brucemiller opened 13 years ago

brucemiller commented 13 years ago

[Originally Ticket 1549]

Although the parsing engine isn't yet powerful enough to generate a sufficiently semantic internal representation that can be converted to content mathml (or OpenMath), there are applications of latexml (with extra declarations, etc) that probably can. For those applications, and in preparation for pushing the parsing further, getting the internal representation rich enough, and getting the conversion right is important.

A good strategy would be to work up a testfile with math markup (TeX ideally, but if that's not good enough, at least experiment with hand-writing the kind of XMath that would be necessary) that will generate at least simple examples of every kind of content mathml expect to cover. Then process and see what it produces. If it's wrong (and initially, it will often be), modify MathML.pm to fix.

dginev commented 13 years ago

I am very interested in both the approach and the goal, and I think I can devote time on it this summer and incorporate parts of the results in my thesis.

Let's talk about this in further detail when you arrive.

dginev commented 10 years ago

A Marpa-take on my MSc thesis, together with a stab at a Content MathML test case can be found at: https://github.com/dginev/LaTeXML-Plugin-MathSyntax

The big issues are:

I would ask for discussion on that very topic - can canonical Content MathML exist, and what would be best practices.

kohlhase commented 10 years ago

Deyan,

a first set of content dictionaries can be seen in the SMGloM we are starting to write. An interesting aspect of this is the availability of notation definitions, which (conceptually) correspond to specialized, context-dependent grammar rules. Even though the SMGloM is far from perfect, and even farther from complete, I think that it might be a good resource in your endeavor.

dginev commented 10 years ago

Dear Michael,

I would be happy to follow a pointer to the resource you're advertising and see if I can give a second life to your notations as test cases for math parsing. I don't know how to get my hands on the notations at the moment.

But the need of canonical conventions is something we will need to address, ideally early on, if we want to have a real shot in comparing against gold standards.

As I am experimenting with annotating the DLMF formulas with Content MathML, I am seeing two opposite tensions. One is to produce as precise and formal Content MathML as possible, e.g. to the point that the formula becomes machine-verifiable in a CAS. The other is to stay as close as possible to the actual input written by the author.

Using ellipsis (...) is a basic example of this issue - the real deep semantics are very different from the written form of e.g. an algebraic progression. It's even weirder when we have ellipsis used to introduce a sequence of variable names to be used later, e.g. \alpha_1 \ldots \alpha_n.

And the eliding dots are just one of many examples of notations that do not directly correspond to a content form (or have more than one possible formalization). So introducing rules for canonical Content MathML seems a best practice we can invest in.

My current thinking is that canonical forms do not need to be enforced on the annotator (or sTeX author), but instead can be machine-computable in e.g. the sTeX conversion or the grammar Testing API.