GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 250 forks source link

Drawing rectbin is extremly slow. #133

Open mokasin opened 10 years ago

mokasin commented 10 years ago

When using the geom rectbin for drawing grayscale like images or matrices the drawing operation does not scale very well.

Executing following code on my machine (i5 3570K, DDR3 1600) takes nearly half a minute:

using Gadfly
using DataFrames

# dimension of matrix
dim = (500,500)

# create data
A = rand(dim...)
xy  = vcat([ [x y] for x = 1:dim[1], y = 1:dim[2] ]...)
z = reshape(A, *(dim...))

# convert to dataframe
df = DataFrame(hcat(xy,z), ["x", "y", "z"]);

p = plot(df, x="x", y="y", color="z",Geom.rectbin)

# this one takes loooooong
draw(PNG("rectbin.png", 12cm, 6cm), p)

Plotting an 1000x1000 matrix simply does take much too long. Routines like imshow from Python's Matplotlib need only a second to plot this.

When profiling the draw() function in the code above one can see many calls of sort! outgoing from the scale.jl.

Further investigations of what is taking so long seem necessary.

dcjones commented 10 years ago

It should be somewhat faster now, but still quite slow. I know roughly what needs to be done to improve the speed here so I'll try to get to it soon.

mokasin commented 10 years ago

I'll be glad, if you elaborate the issue a little bit. It would be instructing.

dcjones commented 10 years ago

The calls to sort! you saw were from unnecessarily storing something as PooledDataArray, rather than just a DataArray. That was easy to fix.

The slowness now is simply Gadfly (actually Compose) not being particularly fast at drawing very complex graphics. What you're plotting involves drawing 250000 or 1000000 rectangles, so it takes a while. I've not put much work into optimizing Gadfly, so that's something I need to improve in general.

That said, the example here is essentially coloring individual pixels. It will be always be pretty inefficient if rendered using the SVG or D3 backends. That makes me think there should be some sort of special handling for what's is essentially raster graphics.

mokasin commented 10 years ago

That makes me think there should be some sort of special handling for what's is essentially raster graphics.

That sounds sensible. ggplot2 also defines a special geom for it: geom_raster.

At least D3 kind of supports this using the canvas element: http://bl.ocks.org/mbostock/3074470 http://bl.ocks.org/mbostock/3289530

dcjones commented 10 years ago

Latest update: I've added the ability to rasterize parts of a SVG image and embed them as PNG. I still need to expose this in Gadfly. My first thought was to add an argument to rectbin, like Geom.rectbin(raster=true), but now I'm thinking this should just be an argument to plot, like plot(..., raster=true) that will cause all the geometry to be rasterized. Sound reasonable?

mokasin commented 10 years ago

Sounds reasonable to me. I'd also stick the flag to plot instead of a specific geometry. Maybe you want, for some arbitrary reasons, plot many many lines or points too. :thumbsup:

ssfrr commented 10 years ago

This sort of feature would be super useful for me as well. I used Winston for the figures in a project a few months ago, and I was hoping that I could switch to Gadfly.

In audio work it's extremely common to want to plot a spectrogram, which for the purposes of plotting is basically just a 2D matrix of floats:

image

It would be great to be able to do this sort of thing and generate beautiful Gadfly plots!

dcjones commented 10 years ago

I've made changes to Compose and Gadfly to rasterize part of an SVG plot and embed it as an image. This solves the fontend slowness I think: you can have zoomable plots with these sorts of dense heat maps now by doing:

plot(df, x=:x, y=:y, color=:z, Geom.rectbin, Coord.cartesian(raster=true))

Just generating the plot is still super-slow though (@mokasin's example takes nearly 40 to render for me). I'll look into optimizing that.

dcjones commented 10 years ago

I did some work optimizing Compose and Gadfly today. As it stands @mokasin's example takes ~4.1 seconds (on the second call, the first takes 27 seconds). The optimizations mostly aren't specific to rectbin so Gadfly should overall be significantly faster.

Now I'm running up against the fundamental inefficiency of using a vector graphics system to work at the pixel level. To match the performance of imshow, it really needs to operate the same way: color a million pixels rather than draw a million rectangles. So I think the ultimate solution will be to implement direct support for bitmaps without going through Cairo. I'll leave this issue open until I get around to that.

ssfrr commented 10 years ago

Awesome, thanks for the work on this!

For imagesc / imshow (drawing a matrix as an image), is rectbin the right approach or spy, or something else?

dcjones commented 10 years ago

They ultimately do the same thing, spy is just shorthand to simplify plotting matrices and to be somewhat familiar to matlab users.

ssfrr commented 10 years ago

Just tried it with spy in an IJulia notebook:

m = rand(100, 100);
@time spy(m, Coord.cartesian(raster=true))
-----------
elapsed time: 0.000245426 seconds (253744 bytes allocated)
ctx not defined
 in drawpart at /Users/srussell/.julia/Compose/src/container.jl:343
 in draw at /Users/srussell/.julia/Compose/src/container.jl:278
 in writemime at /Users/srussell/.julia/Gadfly/src/Gadfly.jl:801
 in sprint at iostream.jl:229
 in display_dict at /Users/srussell/.julia/IJulia/src/execute_request.jl:31
ssfrr commented 10 years ago

Whoops, I hadn't checked out the latest master of Compose, so I was getting the same error with both spy and @mokasin's approach. Working now after pulling Compose.