brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
950 stars 101 forks source link

Convert PDF graphics to scalable SVGs #902

Open bfirsh opened 6 years ago

bfirsh commented 6 years ago

When using --graphicsmap=pdf.svg, it converts the graphic to a SVG with a raster rendering of the PDF. I would expect it to convert it to a vector graphic. The same presumably applies for EPS, AI, and PS.

For Engrafo, we had success using pdf2svg. Presumably the same result can be achieved by piping it through the same LaTeX rendering system that renders math/tikzpicture as SVG.

brucemiller commented 6 years ago

But does it always generate a raster? I'd think it would depend on what kinds of drawing are in the pdf itself; more line oriented would generate vectors, but pdf that has a raster embedded is going to generate a raster in the svg. Do you have any small samples where you'd expect vectors?

[We're using ImageMagick for almost all image conversion. It's finicky, but it's nice to be able to rely on a single tool/dependence]

bfirsh commented 6 years ago

If it’s vector in the source PDF, it’s vector in the output SVG. If it’s raster in the source PDF, it’s raster in the output SVG. That’s what pdf2svg does.

Presumably the same process which produces math and tikz SVGs would do the same thing? (I’m not sure how that works but it seems to run them through TeX then somehow outputs an SVG.) If that system were used, then it’s not adding any additional dependencies.

brucemiller commented 6 years ago

On 12/08/2017 08:48 AM, Ben Firshman wrote:

If it’s vector in the source PDF, it’s vector in the output SVG. If it’s raster in the source PDF, it’s raster in the output SVG. That’s what pdf2svg does.

Yeah; that's my question: Does LaTeXML (using ImageMagick) not do that? Does it always generate a raster?

dginev commented 6 years ago

Oh, this is a topic I have some very painful experience in, maybe I say a couple of words.

First, imagemagick has "pathological" behavior on certain (very hard to classify or predict) PDF/eps inputs, in particular PDFs that encode vectorial graphics. What I mean by pathological is that it will do any of - an infinite loop in runtime, out of memory exception, silent failure with no image produced and files leftover on the filesystem...

One workaround that was widely used and approved in places such as StackOverflow was to delegate vectorial PDFs to a different processing engine, and in particular - a headless inkscape process. This is something I have seen work very reliably in the past, and in fact also shows the inverse problem - there are pathological images that don't convert in inkscape in say 10 minutes, that finish in a few seconds in imagemagick. And vice-versa, generally following the vectors vs pixels distinction.

On the upside, and to go back to your discussion here, when inkscape succeeds with the conversion the resulting SVG is truly preserving the vectorial definitions in the PDF, and the final image is high quality (if you get the appropriate web fonts to match the PDF fonts, or you get authors writing to you that the kerning is off by several millimeters...).

Should latexml rely on inkscape as well for these cases? Hard to say, it is a rather large dependency and may feel better as a plugin than as a core component. There are entire companies that deal with image conversion / hosting so it's an admittedly large and non-trivial problem. There may also be some space to think about where an exact line needs to be drawn between latexml and a general-purpose image conversion tool.

Lastly, arxiv conversions have given me endless grief in these "pathological" cases - you can't rely on the latexml process to time out / back out of an underlying infinite loop in C (where imagemagick can get stuck), so you need an external watchdog process to monitor that. This is a big part of what LaTeXML-Plugin-Cortex ended up covering for.

Related issues for some background: #663 and #666

bfirsh commented 6 years ago

@brucemiller Here's an example of a vector PDF figure: cifar10_48-48-10_batch_10_plot.pdf

Here's the latexml output: https://www.arxiv-vanity.com/papers/1703.00441v2/#S4.F2.sf3

It is also particularly fuzzy because the DPI is not configurable, but that's another problem!

pdf2svg converted this to a scalable SVG without problems.

bfirsh commented 6 years ago

@dginev Agreed that a plugin is a good place to start. Perhaps it could be optional core functionality, so it isn't a hard dependency.

I might have a shot at a pdf2svg plugin to fix this for Arxiv Vanity, if I get round to it.

brucemiller commented 6 years ago

Ah, yes, of course ImageMagick isn't preserving the vectorness; it's basically a pipeline of raster operations, so the first thing it'll want to do is convert to an internal raster. By the same token, even if we introduce a dependency on pdf2svg to keep the image in raster form, we'd need a vector alternative duplicating the whole transform sequence (all the stuff that graphicx brings in). With svg, this is of course possible, maybe even "easy" in some sense, but a whole bunch of new code & testing. In other words, a bit tricky.

dginev commented 5 years ago

A bit too open-ended for 0.8.4, pushing back to 0.9 until we have an attack plan in mind.

dginev commented 2 years ago

To pin down a concrete high-difficulty test for this direction of work, today I encountered arXiv:1804.00311.

That article has multiple graphics using PDF assets which take north of 5 minutes to convert via ghostscript -- and even encounter API errors for metadata operations, such as obtaining the size. If we can become more efficient and correct in such cases, as we also start producing SVG for them, that would be an excellent outcome.

dginev commented 8 months ago

Today I also stumbled on another gs-intensive example from arXiv:1807.01606. Attaching one PDF asset for future testing - it takes 11 minutes to execute gs on my machine.


[fig8.pdf](https://github.com/brucemiller/LaTeXML/files/14463618/fig8.pdf) ```tex \documentclass{article} \usepackage{graphicx} \begin{document} \includegraphics[width=10cm]{fig8.pdf} \end{document} ``` Resulting PNG: ![image](https://github.com/brucemiller/LaTeXML/assets/348975/24fcfe24-ac5e-4f73-8e9e-3ae44db0be94)

Since the article has 15 of these PDFs, it reliably times out with the current build setup.