Closed jfine2358 closed 8 months ago
Hi Jonathan,
Thank you for looking into ar5iv and starting this discussion.
Some general notes first:
Naturally, we've experienced how overwhelming it can be to deal with a large corpus of TeX, so we created a build system to systematically aggregate latexml's log messages already at the original start of the arXMLiv project, back in ~2007. Anyone can query that system at: https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml
For downloading the logs of 10,000 examples, assuming you have a list of arXiv ids, this can be done with a simple loop that fetches https://ar5iv.labs.arxiv.org/log/<arxivid>
. The ar5iv site is actively crawled by many interested parties at the moment (most recently I was told that SearchOnMath deployed a search index over ar5iv). So you are very welcome to grab any log entries you may need.
We are always open to monitoring new classes of problems that latexml conversion remains "silent" on today. If there is a pattern that can be automatically recognized during conversion, we can add and emit more messages (in 4 severities: Info/Warning/Error/Fatal), to track them at scale.
I actually find it practical to crowd source collecting concrete problems in ar5iv. In recent years we have had a trend where what used to be considered a "large" number for development has subsided down to "medium" or even "small". A thousand issues may have been seen as large, but is now commonplace. Consider the rust language issues, where we see ~9000 open issues, ~40,000 closed issues and ~60,000 closed pull requests. I think a modern project backed by a large community should aspire to such numbers - and I am hoping the issues in this ar5iv repository will eventually approach this kind of magnitude.
On the formula example:
For math parsing warnings, these are best used specifically to improve latexml's MathGrammar module, but often do not translate into visual degradation in the rendered HTML. latexml has a fallback parsing mode (similar to an extent to the way MathJax and KaTeX deal with math) which takes over when the grammatical parse fails.
Certainly there are many remaining upgrades and inaccuracies to math parsing, and concrete reports of encountering them are always appreciated. Some are sufficiently difficult that resolving them in full requires swapping the entire grammar engine for one capable of dealing with ambiguity, which is one branch of ongoing work in latexml.
The aggregated report for math parsing failures (reported via the ALLCAPS grammatical categories of the concrete latexml grammar) is here: https://corpora.mathweb.org/corpus/arxmliv/tex%5Fto%5Fhtml/warning/not%5Fparsed?all=false
As just one example, we've noticed that the most common parsing failure has to deal with unbalanced parens, often in frivolous TeX uses such as $(1+$ z $)^3$
found in astro-ph/0001053. There are other issues for an OPEN
not finding its CLOSE
, but the report tells us latexml fails to do so in 14% of arXiv articles.
For cases like this where math mode gets interrupted, as is with your example $\sim 4 \times$ $10^{6}$...
, the parse warning is also useful feedback to authors using latexml, who may want to edit their formulas to parse grammatically. latexml already covers a few choice cases that would extract an isolated construct out of math mode and deposit it in a textual element. We have been discussing - but haven't yet settled on - a few choice cases where adjacent math elements may be merge-able, so as to improve the parsing success rates.
And lastly - triaging issues. We have been somewhat disciplined in adding support for the "next most needed" package in arXiv with the limited time we have available for that work. And respectively - fixing the "next most common" Error and Fatal issues.
The aggregate reports are indeed very informative in that regard. But as the issues in this repository often point out, there is a difference between broad coverage over arXiv and pixel-perfect individual articles. Sometimes we will manage to preserve the content (and thus have no log messages emitted by latexml), but be inaccurate with the exact sizing and styling of the emitted HTML elements - which leads to clunky rendering. In those cases especially, having the arXiv community report the problems back is extremely helpful, as they may remain invisible to us otherwise.
Hope some of that is helpful - I think we are very much in alignment on how to generally approach solving the "arXiv to HTML problem".
Oh and a fun note at the end: if you want to draw 10,000 ar5iv articles at random, consider using the "feeling lucky" feature. One can fetch the URL
https://ar5iv.labs.arxiv.org/feeling_lucky
and follow the redirect - each visit should lead to a different article. Then swapping /html/
for /log/
you can obtain the log info.
Thanks again for the conversation and detailed testing of ar5iv!
I will close here, but you are always welcome to open more issues for specific articles, feature requests, or general quality-of-life upgrades for latexml .
Exact location of issue
Thanks to the detailed conversion reports, it is now easy to find conversion errors and warnings. But understanding them on a corpus of 2.5 million documents requires power tools. It's not practical to eyeball that many documents, even with several hundred pairs of eyes.
This issue is prompted by a single example. Exploring conversion reports, in https://ar5iv.labs.arxiv.org/log/1910.07940 I came across:
I looked at the HTML in my browser, and didn't see anything missing. I then looked at the TeX source. The fragment in context is:
Notice we have two math formulas separated by white space. Now the MULOP warning makes sense. And also, on second sight, some mathmode strange spacing following that formula, in the HTML. And the PDF has the same strange spacing. (Well done for fidelity to the PDF here.)
The task prompted by this example is to find most of the instances where a conversion error math formula is followed, after a space, by another math formula.
Problem details
A good first step would be an easy way to download a representative sample of conversion reports.
My guess is that there will be at least 10,000 examples of such LaTeX coding in the arXiv corpus, and at most 100,000. But that's only a guess.
Once problem-spotters have a representative sample of conversion reports, they can identify frequent problems and start to understand them. They can also estimate their frequence over the whole corpus, thereby meeting the goal of understanding many conversion warnings and errors.
Looking ahead, the word 'many' is imprecise, as is 'almost all'. Understanding 5% would I think already be enough to provide some recommendations for authors. And also automated quality-checking tools.