simple math tags should render as unicode or mathphrase

AllenDowney commented 12 years ago

Plastex currently dispatches all math to be imaged. Instead, we would like simple math to be translated into DocBook, and complex math either handed off to an imager or rendered in MathML.

So the logic we want is something like

1) Try to render math as DocBook.

2) If (1) fails, try to render as MathML.

3) If (2) fails, hand it off to the imager.

That way we can start with simple versions of (1) and (2) and gradually add capability.

The question is: where do we put this logic? I am having a hard time finding a hook or a similar example.

Examples:

$x$ gets parsed as

<math id="a0000000580">x</math>

and should be rendered as

<mathphrase>x</mathphrase>

$\mu$ gets parsed as

<math id="a0000000583"><mu id="a0000000584"/></math>

and should be rendered as a unicode mu character.

It looks like Base/LaTeX/Math.py already knows the unicode for a lot of math.

tiarno commented 12 years ago

I think leaving the textobject in the equation with role='tex' and containing the latex for the math will work. Then, in postprocessing, you can do something like this:

for elem in inlinemath:
    if elem.getparent().tag == '{http://docbook.org/ns/docbook}inlineequation':
        textelem = elem.find('d:textobject[@role="tex"]/d:phrase', namespaces=xns)
        text = textelem.text
        elem.getparent().replace(elem, get_mathml(text))

where the function get_mathml(text) receives the latex string and returns the mathml. I have a very slow version of this now calling MathToWeb in a python subprocess for each equation.

AllenDowney commented 12 years ago

I can do that for development, but if I render math during postprocessing, I've missed the change to fall back on the imager for math I can't handle.

Any way I can get this logic into the first rendering pass?

Allen

On Wed, Jul 25, 2012 at 2:02 PM, Tim Arnold < reply@reply.github.com

wrote:

I think leaving the textobject in the equation with role='tex' and containing the latex for the math will work. Then, in postprocessing, you can do something like this:
for elem in inlinemath:
    if elem.getparent().tag == '{
http://docbook.org/ns/docbook}inlineequation':
        textelem = elem.find('d:textobject[@role="tex"]/d:phrase',
namespaces=xns)
        text = textelem.text
        elem.getparent().replace(elem, get_mathml(text))
where the function get_mathml(text) receives the latex string and returns the mathml. I have a very slow version of this now calling MathToWeb in a python subprocess for each equation.

Reply to this email directly or view it on GitHub:

https://github.com/AllenDowney/plastex-oreilly/issues/11#issuecomment-7256685

tiarno commented 12 years ago

that's the beauty of it. you already have the images created from the first pass. You just skip the element replacement if get_mathml returns an error (or anything that doesn't validate as mathml).

AllenDowney commented 12 years ago

I dunno. It still seems like we're replicating effort. Plastex has already parsed the latex, and it knows the unicode for the math symbols, etc. We just need to render it.

If we pass raw latex through to a post-processor, we have to parse it again and replicate all the information that's already in Base/LaTeX/Math.py

One possibility is to add it to TreeCleaner. I could look for math tags that are easy to convert, then transform them into a mathphrase tag that gets rendered pretty much literally.

Do you think we can get Kevin Smith to weigh in?

tiarno commented 12 years ago

Well., you're right of course. It really isn't the best way to go about it in rendering the math twice. I think the best thing would be to capture the math during parsing, send it to an external process to get mathml out of it, and insert that mathml element into the DOM. I don't know how to do that, so that's why I was thinking of my clunkier process.

I'm not sure this situation is analogous to one I saw a few years ago, but here is what happened. We went to great efforts to render items in math using html markup if possible, and if not, then we created an image. We were proud. Then some writer was comparing two cases in which the result was 2^r (rendered in html) versus \sqrt{2} (rendered as an image). And it looked really bad because the two things were sitting next to each other, the reader invited to compare the values and they looked very different because of the rendering. So we ended up throwing out our decision making code we were so proud of and the writers were very happy because all the math looked the same.

That cautionary tale may not apply in this situation of unicode vs mathml; I'm no expert on how those get rendered, but it sounds like it might be a similar thing. If this was my personal project I would render all the math the same way, using mathml with png images as a fallback. How to get the mathml is another problem. Using Kevin's method of writing the images.tex file and calling an external program to write external files would also be workable for docbook--the mathml files (one equation per file) could be xincluded.

just my thoughts at the end of the day.

AllenDowney commented 12 years ago

Thanks, that's all very helpful. Yes, we will have the problem that different math gets rendered differently, but I think the O'Reilly folks have a plan to make it all look great downstream!

Allen

On Wed, Jul 25, 2012 at 5:26 PM, Tim Arnold < reply@reply.github.com

wrote:

Well., you're right of course. It really isn't the best way to go about it in rendering the math twice. I think the best thing would be to capture the math during parsing, send it to an external process to get mathml out of it, and insert that mathml element into the DOM. I don't know how to do that, so that's why I was thinking of my clunkier process.

I'm not sure this situation is analogous to one I saw a few years ago, but here is what happened. We went to great efforts to render items in math using html markup if possible, and if not, then we created an image. We were proud. Then some writer was comparing two cases in which the result was 2^r (rendered in html) versus \sqrt{2} (rendered as an image). And it looked really bad because the two things were sitting next to each other, the reader invited to compare the values and they looked very different because of the rendering. So we ended up throwing out our decision making code we were so proud of and the writers were very happy because all the math looked the same.

That cautionary tale may not apply in this situation of unicode vs mathml; I'm no expert on how those get rendered, but it sounds like it might be a similar thing. If this was my personal project I would render all the math the same way, using mathml with png images as a fallback. How to get the mathml is another problem. Using Kevin's method of writing the images.tex file and calling an external program to write external files would also be workable for docbook--the mathml files (one equation per file) could be xincluded.

just my thoughts at the end of the day.

Reply to this email directly or view it on GitHub:

https://github.com/AllenDowney/plastex-oreilly/issues/11#issuecomment-7262674

AllenDowney commented 12 years ago

Ok, I've done two experiments:

1) Replacing some math tags with mathphrase tags in the TreeCleaner. This is now working for simple math phrases (but not Greek letters).

2) Doing this replacement in the digest method of the math MathEnvironment in Math.py. This turns out to be more difficult because it happens during parsing and some of the tree manipulation methods I use in TreeCleaner are not obviously available.

So I am proceeding with the TreeCleaner approach for now.

AllenDowney commented 12 years ago

Looks like handling this in TreeCleaner is the way to go.

AllenDowney / plastex-oreilly

simple math tags should render as unicode or mathphrase #11