daqana / tikzDevice

A R package for producing graphics output as PGF/TikZ code for use in TeX documents.
https://daqana.github.io/tikzDevice
132 stars 26 forks source link

Memory and plots with lots of data #103

Closed elbamos closed 9 years ago

elbamos commented 9 years ago

When plotting with tikz, when your data size increases the number of elements on the chart goes up and quickly brings TeX to a crawl and exhausts memory. It looks like a bunch of us have the same issue: http://stackoverflow.com/questions/26299311/tikz-takes-more-than-max-latex-memory-for-complex-r-plot

For me, I started an experiment using a sample of 6000 data points. I made a really nice rmarkdown document from that. Now I'd like to rend the whole thing with the entire dataset of 110,000 points. I'm sure you know what happened...

externalize=true doesn't seem to solve the issue.

There has to be some way tikz can handle this.

krlmlr commented 9 years ago

Would you like to share your example, perhaps after substituting your data?

elbamos commented 9 years ago

It's rather tricky to come up with a minimally reproducible example for an issue that arises when the size of the plot grows very large!

Any suggestions on how to do that?

krlmlr commented 9 years ago

The example can be minimal even if it's not small. Can you create data similar to your original data, so that the size of a data is a variable, say, N?

elbamos commented 9 years ago

I'll give it a try today.

On Jan 26, 2015, at 5:27 PM, Kirill Müller notifications@github.com wrote:

The example can be minimal even if it's not small. Can you create data similar to your original data, so that the size of a data is a variable, say, N?

— Reply to this email directly or view it on GitHub.

elbamos commented 9 years ago

Try this rmarkdown document:

---
title: "Contrived Example"
output: pdf_document
---

```{r dev="tikz",fig.cap="Contrived Example"}
library(ggplot2)
data.frame(col1 = rnorm(1000000), col2 = rnorm(1000000), col3 = factor(sample(1:2, size = 1000000, replace = TRUE))) %>%
  ggplot(aes(x = col1, y = col2)) +
  geom_point(position = "jitter", alpha = 0.2, size = 1.5) +
  geom_smooth() +
  facet_wrap(~col3, ncol = 1) 
elbamos commented 9 years ago

No response? ok...

krlmlr commented 9 years ago

Thanks for your example, I'll take a look at it when I next work on this package.

smason commented 9 years ago

Yes, but I'm not sure what it's going to do otherwise—even rendering a million points like this to a PDF will cause most renderers to grind to a halt, I'm surprised TeX gets as far as it does!

I've tended to go with rendering to an image and then embedding this in a normal plot. For example, rewriting your example terms of R's base graphics:

n=1000000; x=rnorm(n); y=rnorm(n)
tikz("test.tex",standAlone=TRUE); plot(x,y); dev.off()
system("latex test.tex")

When dealing with this many points, I would tend to write them to a PNG file first:

png("inner.png",width=8,height=6,units="in",res=300,bg="transparent")
par(mar=c(0,0,0,0))
plot.new(); plot.window(range(x), range(y))
usr <- par("usr")
points(x,y)
dev.off()

and then render this into the "real" plot in the correct place, with the usefully vector bits over the top:

im <- readPNG("inner.png",native=TRUE)

tikz("test.tex",7,6,standAlone=TRUE)
plot.new(); plot.window(usr[1:2],usr[3:4],xaxs="i",yaxs="i")
rasterImage(im, usr[1],usr[3],usr[2],usr[4])
axis(1); axis(2); box(); title(xlab="x",ylab="y")
dev.off()

Note, that this technique works with most vector renderers, for example you can replace the tikz() device above with a call to pdf() or cairo_pdf() and get a much smaller PDF than the you would get than if you did the naïve thing of just calling sending all points to the vector device, this is why I started using this sort of thing. However this technique is certainly more fiddly and fragile than the gg syntax though!

Doubt if tikzDevice could do this sort of thing automatically, AFAIK R exposes the wrong sort of interface for this sort of thing to be done easily…

krlmlr commented 9 years ago

@smason: Thanks for chiming in.

It's true that it will probably take a long time to render a million data points from a PDF. But if LaTeX+tikz can be persuaded to create such a PDF, this is what tikzDevice should do. One can also convert a PDF created by the tikzDevice to a high-resolution PNG to accelerate rendering.

smason commented 9 years ago

tikzDevice already works fine for me when outputting a million points, it was LaTeX that fell over with a lack of memory…

I can get up to n~=7500 in a standard LaTeX run, while luaLaTeX seems to be fine above that. They are comparable on how long each takes to run on smaller inputs, luaLaTeX just doesn't seem to have the hard limits on memory use. I've tried as far as 30k points, which took about a minute for me and resulting PDF looked fine.

elbamos commented 9 years ago

@smason You have a good point -- I'm using tikzDevice with the tufte-handbook rmarkdown template, which forces pdflatex, apparently because the template doesn't want to compile with luatex or xetex. I'm not sure whose bug it is then, but it may not be a tikzDevice bug.

Just to clarify, my contrived example is only intended to replicate the error. My use case is not so trivial. I'm trying to make a slew of very different charts -- some boxplots, some scatterplots, some node-and-edge charts of graphs -- from the same set of data. I prototyped with n ~= 1000, and tikzDevice produced some really very beautiful plots. My full dataset, n ~= 9000. tikzDevice (or pdflatex, or whatever) now fails with many of them.

One of the really nice things about tikzDevice plots is that, as long as the plot doesn't error-out, they can look really beautiful and elegant with lots and lots of data, when a plot with another R device would get blurry and mushy and ugly.

krlmlr commented 9 years ago

If it works with LuaLaTeX, you can compile a standalone plot and use \includegraphics with the generated PDF in your final document. If the preambles are similar (especially w.r.t. font settings), the result should be virtually identical. This is what I usually do to streamline the process -- I don't want to tikz all of my plots each time I compile the document that contains them. If it helps, I can share my setup that includes a Makefile and other bells and whistles.

smason commented 9 years ago

When you say "a plot with another R device would get blurry and mushy and ugly", what other devices did you try? They all end up very similar in my experience, you just need to make sure you're asking for a high enough resolution version if it's being rasterised. Otherwise, try a vector format like pdf (the cairo_pdf device is most flexible in my experience).

Just doing a bit of searching, and if you're doing a scatterplot with lots of points, how about binning them first. Something like ggplot(aes(x=col1,y=col2)) + geom_bin2d() instead.

elbamos commented 9 years ago

@krlmlr thank you, I would appreciate taking a look at that. I've been meaning to try to port the tufte-book class to rmarkdown, which should work with luatex/xetex as well. All I wanted was a pretty set of charts, now I'm learning tex... ugh...

@smason I used a scatter as an example because it was easy to trigger the error. I already converted most of the scatterplots to hexbins. The ones that aren't converted, have different point colors so they're not amenable to binning.

When I say they get blurry and mushy, I mean CairoPDF and QuartzPDF. The visual difference from tikz is a bit subtle, but its easy to spot when you put the plots side by side. Tikz edges are sharper, fine lines are more clean, etc.

elbamos commented 9 years ago

I was able to get the tufte class to (mostly) compile with luatex, which resolved the memory issue. Looking closely at the plots, I think perhaps what is making them prettier than CairoPDF plots, may be the way tikz handles alpha. I'm happy to help try to provide examples or whatever, but I'm closing this issue as resolved. Thanks guys!

smason commented 9 years ago

Hope this post doesn't reopen this issue!

@elbamos I'd personally be interested in seeing what you mean. In my (limited) testing I have noticed that tikz tends to use a slightly finer line-width by default than cairo, but otherwise things look very similar. Then again, I do tend to go for Tufte-style minimalist black-and-white plots wherever I can.

elbamos commented 9 years ago

@smason Ok, I'm making some examples to give to you. (Don't judge my plots lol...) Looks like its going to come to a total of 8 MB or so of pdf files - should I e-mail them?

smason commented 9 years ago

Please do, my email should be on my github profile… Sam

On Tue Feb 10 2015 at 4:36:03 PM elbamos notifications@github.com wrote:

@smason https://github.com/smason Ok, I'm making some examples to give to you. (Don't judge my plots lol...) Looks like its going to come to a total of 8 MB or so of pdf files - should I e-mail them?

— Reply to this email directly or view it on GitHub https://github.com/yihui/tikzDevice/issues/103#issuecomment-73732534.

elbamos commented 9 years ago

Sent. If you don't receive them within an hour or so let me know.

elbamos commented 9 years ago

Ok, that didn't work, so I uploaded them here: https://github.com/elbamos/samson

goens commented 3 years ago

This is quite some years later but I have been running onto this issue for a while. Workarounds like lualatex or manually rasterizing do work, but they're not really solving the problem from its root. I saw this (also old) stackoverflow post: https://stackoverflow.com/questions/43565940/reducing-the-output-file-size-of-tikzdevice .

The interesting thing about this post is how they reference the way it works on the matlab matlab2tikz package, with an option minimumPointsDistance, which basically serves to filter out points that are too close to each other (and probably won't even make a visual difference when removed, depending on the value). I thought this might work as a proper solution to this problem and wanted to mention it here (or rather ask what you think).

davidwoodburn commented 1 year ago

I would argue that LaTeX running out of memory merely highlights the underlying problem: the figure has too many points. Even if you find a way to create a vector image with a million points, no one can discern all those points and the image will be several megabytes large, which becomes a problem for emailing, as demonstrated in the above conversation. The right solution (I believe) is to use any of various line-simplification methods (e.g., under-sampling, Ramer-Douglas-Peucker, Visvalingam–Whyatt). Remember, the purpose of a figure is not to transmit data, but to illustrate relationships.