brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
955 stars 101 forks source link

Make --includestyles the default #622

Open brucemiller opened 9 years ago

brucemiller commented 9 years ago

Eventually, we ought to just attempt to read any and all tex, sty and (to some extent) even cls files. Gradually our infrastructure gets better and we might succeed more often than not. It seems that the lack of reading such style files, especially local user macros, confuses some users who don't see (or interpret) the --includestyles option.

Certainly most user macro files would be sensible, since it's typically the stuff that gets put in the preamble anyway. Style files that are simply styling, providing they get processed w/o error would end up essentially ignored. Genuine enhancement style files that define new markup may be trickier; there would be no semantics associated with the new markup. But they probably would fail less badly than they do now! Class files are the trickiest since they typically define new frontmatter markup which you'd prefer not to ignore. But if unknown class files default to loading the OmniClass as an underlying backup, maybe they'd fail less worse too.

There's a cascading problem to not reading in the style & class files: they often require other packages that we won't be noticing, and we sometimes do have bindings for those!

It seems that there's two things we'd want to do

Mainly, I'm just making some noise to see what kind of feedback and ideas I get! :>

dginev commented 9 years ago

Some time back we had a brief discussion in this direction and I think we all supported this as the logical way forward. I am also much in favour of including raw TeX styles by default, and even completely removing the --includestyles option.

kohlhase commented 9 years ago

(see also #630) If we do this, then we should also have some good way to making partial bindings that read the style file and then overdefine the macros we want to have semantic.

dginev commented 8 years ago

My comment from 6 months ago forgot to mention that I am excited to drop the --includestyles option only if loading and using TeX styles doesn't pose a significant increase in runtime.

For example natively interpreting tikz could lead to a 10x slowdown of a latexml pass over a document, and being unable to opt out of that is a very bad usability restriction. If runtime wasn't an issue, I would be quite happy to always try and load the styles.

dginev commented 6 years ago

Recording an experiment I did today while doing arXiv debugging with Bruce.

The screenshot shows the regular conversion (left) compared to the conversion with --includestyles on, i.e. with native interpretation of the texlive packages enabled (right). The data is from the arXiv documents from February 2018.

includestyles-compared

Bruce remarked the success rate is better than he initially expected, and there may be interesting low-hanging fruits (a lot of Fatals relate to etoolbox for example). We need some improved error-reporting messages to be able to figure out which native packages cause the most trouble, but it's definitely a curious way forward.

You can compare the two reports at the following fragile link (i.e. expect it to break in the long run, it's an active development site): https://corpora.mathweb.org/corpus/%2Fdata%2Farxiv_1802%2F/

dginev commented 4 years ago

I should mention here that there has been progress in slowly getting to full raw interpretation, as we are now running arXiv with all local styles included, i.e. custom styles that were submitted along with the paper source. The option for that is slightly different, we do a --preload[localrawstyles]latexml.sty. And it consistently improves the conversion rate, which is encouraging.

dginev commented 3 years ago

After all the digging, I finally stumbled on an arXiv source that does better with --includestyles than it does without. In particular math/0101267 passes with no_problems when its class file is interpreted raw. This case is still the exception, as we're missing a variety of tex/latex internals that make complex .cls loads fail, but it begs the question if we can "pick-and-choose" the interpretation based on outcomes.

I.e. one could imagine a single default behavior which tries to convert a document in steps:

  1. try with only .ltxml bindings, and stop if successful (no errors)
  2. else, try with bindings and raw local styles, and stop if successful (no errors)
  3. else, try with all raw files allowed, and stop if successful (no errors)
  4. else, examine the status reports of the three runs we just made and return the quantitatively "best" run (least errors/fatals).

Where the fallback case 4. can be discussed, various options make sense for various use cases. What would be valuable is not missing the cases where --includestyles actually does work, so that we maximize the possible success latexml could have over an unknown/unknowable corpus.

This type of suggestion could also be a nice intermediate hand-holding step towards enabling raw interpretation at all times. But just brainstorming really.