Open bfirsh opened 6 years ago
But does it always generate a raster? I'd think it would depend on what kinds of drawing are in the pdf itself; more line oriented would generate vectors, but pdf that has a raster embedded is going to generate a raster in the svg. Do you have any small samples where you'd expect vectors?
[We're using ImageMagick for almost all image conversion. It's finicky, but it's nice to be able to rely on a single tool/dependence]
If it’s vector in the source PDF, it’s vector in the output SVG. If it’s raster in the source PDF, it’s raster in the output SVG. That’s what pdf2svg does.
Presumably the same process which produces math and tikz SVGs would do the same thing? (I’m not sure how that works but it seems to run them through TeX then somehow outputs an SVG.) If that system were used, then it’s not adding any additional dependencies.
On 12/08/2017 08:48 AM, Ben Firshman wrote:
If it’s vector in the source PDF, it’s vector in the output SVG. If it’s raster in the source PDF, it’s raster in the output SVG. That’s what pdf2svg does.
Yeah; that's my question: Does LaTeXML (using ImageMagick) not do that? Does it always generate a raster?
Oh, this is a topic I have some very painful experience in, maybe I say a couple of words.
First, imagemagick has "pathological" behavior on certain (very hard to classify or predict) PDF/eps inputs, in particular PDFs that encode vectorial graphics. What I mean by pathological is that it will do any of - an infinite loop in runtime, out of memory exception, silent failure with no image produced and files leftover on the filesystem...
One workaround that was widely used and approved in places such as StackOverflow was to delegate vectorial PDFs to a different processing engine, and in particular - a headless inkscape
process. This is something I have seen work very reliably in the past, and in fact also shows the inverse problem - there are pathological images that don't convert in inkscape in say 10 minutes, that finish in a few seconds in imagemagick. And vice-versa, generally following the vectors vs pixels distinction.
On the upside, and to go back to your discussion here, when inkscape succeeds with the conversion the resulting SVG is truly preserving the vectorial definitions in the PDF, and the final image is high quality (if you get the appropriate web fonts to match the PDF fonts, or you get authors writing to you that the kerning is off by several millimeters...).
Should latexml rely on inkscape as well for these cases? Hard to say, it is a rather large dependency and may feel better as a plugin than as a core component. There are entire companies that deal with image conversion / hosting so it's an admittedly large and non-trivial problem. There may also be some space to think about where an exact line needs to be drawn between latexml and a general-purpose image conversion tool.
Lastly, arxiv conversions have given me endless grief in these "pathological" cases - you can't rely on the latexml process to time out / back out of an underlying infinite loop in C (where imagemagick can get stuck), so you need an external watchdog process to monitor that. This is a big part of what LaTeXML-Plugin-Cortex ended up covering for.
Related issues for some background: #663 and #666
@brucemiller Here's an example of a vector PDF figure: cifar10_48-48-10_batch_10_plot.pdf
Here's the latexml output: https://www.arxiv-vanity.com/papers/1703.00441v2/#S4.F2.sf3
It is also particularly fuzzy because the DPI is not configurable, but that's another problem!
pdf2svg converted this to a scalable SVG without problems.
@dginev Agreed that a plugin is a good place to start. Perhaps it could be optional core functionality, so it isn't a hard dependency.
I might have a shot at a pdf2svg plugin to fix this for Arxiv Vanity, if I get round to it.
Ah, yes, of course ImageMagick isn't preserving the vectorness; it's basically a pipeline of raster operations, so the first thing it'll want to do is convert to an internal raster. By the same token, even if we introduce a dependency on pdf2svg
to keep the image in raster form, we'd need a vector alternative duplicating the whole transform sequence (all the stuff that graphicx
brings in). With svg, this is of course possible, maybe even "easy" in some sense, but a whole bunch of new code & testing. In other words, a bit tricky.
A bit too open-ended for 0.8.4, pushing back to 0.9 until we have an attack plan in mind.
To pin down a concrete high-difficulty test for this direction of work, today I encountered arXiv:1804.00311.
That article has multiple graphics using PDF assets which take north of 5 minutes to convert via ghostscript -- and even encounter API errors for metadata operations, such as obtaining the size. If we can become more efficient and correct in such cases, as we also start producing SVG for them, that would be an excellent outcome.
Today I also stumbled on another gs-intensive example from arXiv:1807.01606. Attaching one PDF asset for future testing - it takes 11 minutes to execute gs
on my machine.
Since the article has 15 of these PDFs, it reliably times out with the current build setup.
When using
--graphicsmap=pdf.svg
, it converts the graphic to a SVG with a raster rendering of the PDF. I would expect it to convert it to a vector graphic. The same presumably applies for EPS, AI, and PS.For Engrafo, we had success using pdf2svg. Presumably the same result can be achieved by piping it through the same LaTeX rendering system that renders math/tikzpicture as SVG.