Open brucemiller opened 9 years ago
Some time back we had a brief discussion in this direction and I think we all supported this as the logical way forward. I am also much in favour of including raw TeX styles by default, and even completely removing the --includestyles
option.
(see also #630) If we do this, then we should also have some good way to making partial bindings that read the style file and then overdefine the macros we want to have semantic.
My comment from 6 months ago forgot to mention that I am excited to drop the --includestyles
option only if loading and using TeX styles doesn't pose a significant increase in runtime.
For example natively interpreting tikz could lead to a 10x slowdown of a latexml pass over a document, and being unable to opt out of that is a very bad usability restriction. If runtime wasn't an issue, I would be quite happy to always try and load the styles.
Recording an experiment I did today while doing arXiv debugging with Bruce.
The screenshot shows the regular conversion (left) compared to the conversion with --includestyles
on, i.e. with native interpretation of the texlive packages enabled (right). The data is from the arXiv documents from February 2018.
Bruce remarked the success rate is better than he initially expected, and there may be interesting low-hanging fruits (a lot of Fatals relate to etoolbox for example). We need some improved error-reporting messages to be able to figure out which native packages cause the most trouble, but it's definitely a curious way forward.
You can compare the two reports at the following fragile link (i.e. expect it to break in the long run, it's an active development site): https://corpora.mathweb.org/corpus/%2Fdata%2Farxiv_1802%2F/
I should mention here that there has been progress in slowly getting to full raw interpretation, as we are now running arXiv with all local styles included, i.e. custom styles that were submitted along with the paper source. The option for that is slightly different, we do a --preload[localrawstyles]latexml.sty
. And it consistently improves the conversion rate, which is encouraging.
After all the digging, I finally stumbled on an arXiv source that does better with --includestyles
than it does without. In particular math/0101267 passes with no_problems
when its class file is interpreted raw. This case is still the exception, as we're missing a variety of tex/latex internals that make complex .cls loads fail, but it begs the question if we can "pick-and-choose" the interpretation based on outcomes.
I.e. one could imagine a single default behavior which tries to convert a document in steps:
.ltxml
bindings, and stop if successful (no errors)Where the fallback case 4.
can be discussed, various options make sense for various use cases. What would be valuable is not missing the cases where --includestyles
actually does work, so that we maximize the possible success latexml could have over an unknown/unknowable corpus.
This type of suggestion could also be a nice intermediate hand-holding step towards enabling raw interpretation at all times. But just brainstorming really.
Eventually, we ought to just attempt to read any and all tex, sty and (to some extent) even cls files. Gradually our infrastructure gets better and we might succeed more often than not. It seems that the lack of reading such style files, especially local user macros, confuses some users who don't see (or interpret) the --includestyles option.
Certainly most user macro files would be sensible, since it's typically the stuff that gets put in the preamble anyway. Style files that are simply styling, providing they get processed w/o error would end up essentially ignored. Genuine enhancement style files that define new markup may be trickier; there would be no semantics associated with the new markup. But they probably would fail less badly than they do now! Class files are the trickiest since they typically define new frontmatter markup which you'd prefer not to ignore. But if unknown class files default to loading the OmniClass as an underlying backup, maybe they'd fail less worse too.
There's a cascading problem to not reading in the style & class files: they often require other packages that we won't be noticing, and we sometimes do have bindings for those!
It seems that there's two things we'd want to do
Mainly, I'm just making some noise to see what kind of feedback and ideas I get! :>