brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
937 stars 99 forks source link

Timeout ignored in pathological math example #741

Closed dginev closed 8 years ago

dginev commented 8 years ago

In this gist I have attached a pathological TeX snippet that fails to terminate in some 70 seconds, even after it hits a 20 second timeout alarm signal.

It's an anonymized source from Authorea, where an author made a typo and accidentally turned the rest of a block into a huge formula. Note that this times out without math parsing even enabled, as in Authorea we are using the equivalent of:

latexmlc pathological_math_error.tex  
  --noparse --timeout=20 --format=html5 
  --whatsin=fragment --whatsout=fragment

There are two separate issues I would like to solve here:

  1. Make the 20 second timeout succeed and truly terminate processing (it seems to interrupt the absorbtion of math boxes)
  2. Try to optimize latexml's processing for this example, so that it takes a magnitude faster, if possible

Ideas welcome, I'll take a look.

dginev commented 8 years ago

Reading up on Perl's safe signals, shows a very reliable workaround of invoking latexml prefixed with PERL_SIGNALS=unsafe which avoids the delayed + safe signal handling in modern Perl.

That immediately kills the process but risks corrupting Perl's internal datastructures, which isn't an option for a long-living server for example.

I am also wondering why the timeout is reported as a warning here:

Warning:perl:warn Perl warning
    at /tmp/scratch.tex; line 23 col 39
        (in cleanup) 
    Fatal:timeout:timedout Conversion timed out
        at /tmp/scratch.tex; line 23 col 39

As that message is printed exactly at the second the alarm is triggered, so it seems the signal is handled (but probably wrongly?) by LaTeXML's Error module.

dginev commented 8 years ago

I analyzed some of the stack traces and indeed, whenever the alarm lands in the middle of libxml processing, it fails to interrupt the C call. It is actually handled at the right time in Perl - the log messages are printed just in time. Yet the processing isn't terminated, and it seems the full job continues until it is fully complete, even returning the final result + log correctly.

So now I'm looking for a reasonable strategy for interrupting libxml (although the article I linked to seems to suggest there isn't a truly reasonable one)

dginev commented 8 years ago

Maybe for this issue the best tactic is to understand why latexml takes over a minute here (mostly in ligature application it seems) and optimize away the slowness.

Using unsafe signals is certainly one option for systems like CorTeX, but doesn't sound viable for long-running web servers that offer a latexml conversion capability.

dginev commented 8 years ago

The slowest, inner-most absorb of math content timed as follows on my laptop:

-- Absorb of 3281 boxes took  99.51 sec

The example file is just over 3900 characters total, so it's believable that 3281 of those were turned into single-char tokens, and then boxes, and then XMToks. This means absorbing a one character math box takes 0.03s in this example, which seems on the slow side of things. Will see if I can profile this further.

dginev commented 8 years ago

Pathological examples have pathological performance problems. If you look at this line:

https://github.com/brucemiller/LaTeXML/blob/master/lib/LaTeXML/Core/Document.pm#L1050

  my @sibs = $node->parentNode->childNodes;

Currently when we apply ligature checks we first grab all child nodes and then pass them on as arguments to the ligature matcher routine.

What that means in a math formula with 3200 tokens is that we will have function calls with 3200 arguments, which will be pushed/popped from the Perl argument stack. And multiply that by 9 for the number of ligatures available by default via TeX.pool. Ouch!

But the slowest bit here may actually be the ->childNodes call, which needs to instantiate 3200 XML::LibXML::Node objects each time.

As a test, I simply changed that line to:

my @sibs = ();

and the example finished in 1.4 seconds. So this is clearly the bottleneck.

dginev commented 8 years ago

I have added a PR that fixes the performance in the Authorea use case, but we would still suffer timeouts in similar cases for arXMLiv, where math parsing is enabled. I think the PR is still worth merging however, as it offers a very generic speedup.

dginev commented 8 years ago

NOTE: This and the following comment cover a second, separate example, not attached to this issue. I hope my explanation (and PR) are clear enough that the suggestion makes sense as-is.

I also have a pathological recursive example on a very broken snippet using \lx@ams@matrix@ that also goes into libxml2 crazy land and fails to time out until tens of seconds after the alarm gets triggered.

But I'm unsure if the deep recursion has its own part to play. This example doesn't benefit from using PERL_SIGNALS=unsafe, so there may be a slightly different issue.

dginev commented 8 years ago

... that moment when the error report for hitting a timeout causes a second timeout.

So this was very curious. A broken input caused a timeout, which was triggered exactly at the right time, and called Fatal() immediately, which is great and as desired.

However, what happened next is a serious issue for the timeout semantics. Inside Error::generateMessage there is a snippet titled:

  # Add Stack Trace, if that seems worthwhile.

Which, for this example, took 22 seconds to execute. That is an astonishingly long runtime for generating a log message. And it is incredibly ironic that generating the log message for a timeout Fatal takes longer than the actual timeout amount (which was 12 seconds here). Of course the error here is pathological, as there is a deep recursion on what seems to be opening new alignment groups, which would naturally have a huge stack trace. But even in these cases a timeout should stop execution immediately and return control to the caller.

I will prepare a pull request which ensures very lean Fatal reporting for timeouts, and avoids any potentially time-consuming operations in generating the error message. There is simply no time to do any deep stack trace generation in performance-critical settings.

Funny business

dginev commented 8 years ago

Thanks for merging both PRs! I will mark this issue as fixed, definitely resolved the original bug report.

brucemiller commented 8 years ago

no, let's not close it quite yet, but see if we can figure out how to fix the pathological case.

dginev commented 8 years ago

Sure. I think the only reasonable fix is to stop using ->childNodes entirely when fetching the arguments for ligature analysis, but use a window of N instead, and fetch them on by one by doing:

my @children;
my $child = $node-> lastChild; # end of the window
while ($child && (scalar(@children) < $window_size)) {
 unshift @children, $child;
 $child = $child->previousSibling;
}

In the pathological case, the call to childNodes does way too much, having to fetch + autovivify over 3000 XML::LibXML::Node objects.

Let me know if you want me to make a PR with that as well, I can test it out tomorrow. Edit: fixed the bug in my code snippet

brucemiller commented 8 years ago

If you'd like to try your hand at it. You're probably going to want to concoct some extra test cases(?) This stuff is kinda brittle and I don't even trust myself :<

brucemiller commented 8 years ago

I just took a moment to look at the code. Interestingly, it even claims to be inefficient! Primarily in that it repeatedly applies ligatures until all fail! And each one grabs that list of children.

But with 3 exceptions they all really only deal with exactly 2 or exactly 3 tokens! The font and number ligatures take an arbitrary number, and the dots one takes 2 or 3. One design improvement would be to add that info to the ligature and the caller could (a) check whether there are enough (and skip the call if not) and (b) pass exactly that number in. Even if you let the font & number ones still use the inefficent ->childNodes, that would be done much less often.

Wouldn't be perfect, but it would be a big improvement.

dginev commented 8 years ago

An enlightened Rubyist would likely state that an inefficient global fetch of an ordered list is a code smell for using a lazy enumerator.

Or said in human language, if we grabbed ->lastChild and then invoked any number of ->previousSibling lazily, only when it is needed, we would avoid the overhead. There is a global refactor which I can imagine, which is very elegant in Rust, but probably not so much in Perl ...

The lastChild + previousSibling acrobatics can work in this particular case though, no need for too much fanciness.

brucemiller commented 8 years ago

Once I looked at the code, seemed simple enough to try out a couple of things. First, I tried my suggestion of "math ligatures" optionally specifying exactly how many tokens they wanted, to dramatically reduce (but not eliminate) calls to childNode. This gave perl enough breathing room for the timeout signal to take effect (and is backwards compatible). Stil...

So, I took the slightly more radical approach you suggested: just pass the $node and let the ligature use ->previousSibling, as needed. Much better. Now your example fails after 6 seconds (and even that is mostly time the parser is scratching it's head, which could also be looked at). The only catch is that it's an incompatible change!

Of course, DefMathLigature really needs, finally, to be "designed" and simplified. On the one hand, I doubt anybodies actually writing their own, since it is unexplainable. On the other hand, I could do an improved version that only takes keywords, so that I could finesse and recognize an old-style/new-style distinction, based on argument count.

dginev commented 8 years ago

A good thing about having the ecosystem on GitHub is that we can search it. I tried my best to find uses of DefMathLigature in bindings, but only found it in latexml forks.

So I think it is one of the safest pieces to refactor with a breaking change. At the very least it deserves a mention in CHANGES though.

dginev commented 8 years ago

My vote on the refactor approach is again very vague + startupy - could we keep it simple? Complex APIs are always one of the biggest hurdles to overcome with third party tools. If we can support the current features more efficiently and keep the more elaborate design until we need it, good things should happen to us.

brucemiller commented 8 years ago

Hmm, you're making "simple" sound like a great reason for a helluva lota WONTFIX's.... tempting...keep it up.

dginev commented 8 years ago

WONTFIX on 10 somewhat dangerous-bend issues usually results in high quality fixes to 2 issues of core importance, which in the long run keeps a project lean, clean and focused. I think such mantra keeps both startups and open source projects alive and healthy in the long run.

dginev commented 8 years ago

Oh, do you think we can sneak the optimized math ligatures into 0.8.2? I want to reconvert arXiv again, and this set of fixes could get rid of a bunch of timeouts/dirty dies (I hope).

brucemiller commented 8 years ago

Ironically, it's only through catering to exotic tastes that we have the benefit of such sage advise recommending not to cater.

brucemiller commented 8 years ago

New MathLigatures are much easier to write, and more efficent, particularly in broken corner cases. Also fixed an unnecessary recursion in the kludge parser. Updated the POD, but still working on the manual.

dginev commented 8 years ago

Ironically, it's only through catering to exotic tastes that we have the benefit of such sage advise recommending not to cater.

There is some truth to that... indeed the real reason for my sage advice (sarcasm is appreciated) is having used/setup/experimented/helped with with LaTeXML in probably 5 completely different applications by now and developing some sort of intuition about which use cases are central/shared, and which are fringe. And high-effort fringe features can often be tamed down to medium-effort core features, which I've been calling "keeping it simple".

All I'm trying to do here is maximize impact and usability, and I'm always brainstorming. It's always perfectly OK to ignore my newfangled sageness :laughing: and call me out on broken logic.

dginev commented 8 years ago

The new commits are quite cool, looking forward to hitting arXiv with them!