brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
928 stars 98 forks source link

allow latexmlc to take an XML file as input #1496

Open xworld21 opened 3 years ago

xworld21 commented 3 years ago

Maybe there is a way and I have not figured it out... is it possible to let latexmlc take an XML file as input? In other words, to make it work as a replacement of latexmlpost?

xworld21 commented 3 years ago

@dginev I looked at the code, and... I might have implemented the thing I was looking for, in about 20 lines of code. I have leveraged the fact that latexml detects if the input is a BibTeX file (even if passed as literal:), adding XML detection was trivial. It's quite a game changer for me if it works correctly – I can tweak the EPUB output without reparsing everything!

My tentative implementation is in this branch, including handling of parsing errors. I have done my best to match the current behaviour, so in principle it behaves well in the client/server scenario. I think the only missing piece is validating the XML document, I thought it would happen in postprocessing, but it doesn't. Also I don't know if --xmlinput is the right name for it.

dginev commented 3 years ago

Nice that you got a working prototype so quickly @xworld21 !

A bit of a backstory, this capability is one of the two missing pieces before latexmlc is considered worthy of replacing the latexml+latexmlpost combo as the default recommendation, at least as far as we've discussed this with Bruce.

The second capability is being able to output both the final requested --format, as well as the latexml schema XML which you usually see when running latexml-proper.

The vision being that you may start with a very challenging TeX manuscript, and convert it to a desired format X, while also saving the internal XML on the way to X. Then, assuming you liked the X you saw, you could quickly reconvert the internal XML to a different format Y. The classic example was to tailor our runs converting all of arXiv, so that I can output my main HTML5 output, and then quickly reuse the XMLs to also do ePub, TEI, JATS, what have you.

So indeed, important missing feature, and it is half of the story to get latexmlc upgraded to an official executable.

xworld21 commented 3 years ago

The second capability is being able to output both the final requested --format, as well as the latexml schema XML which you usually see when running latexml-proper.

I see, indeed, caching the XML result seems like a sensible thing to do, instead of forcing the user to run multiple calls.

So indeed, important missing feature, and it is half of the story to get latexmlc upgraded to an official executable.

As far as I can tell, there is only one way to implement this half of the story within the current latexmlc, so I'll send a PR with my patch for you to review. It should be equivalent to latexmlpost now, including validation.

I have not tried to make sense of zip archives. In principle, if you pass --xmlinput and a zip file, you may expect latexmlc to search for an XML file instead of TeX, but that means passing the --xmlinput flag to unpack_source, changing the heuristic... that's a much bigger change! So my implementation is just as broken as running --bibtex on a zip file.