GiovineItalia / Gadfly.jl

Crafty statistical graphics for Julia.
http://gadflyjl.org/stable/
Other
1.9k stars 253 forks source link

A cleaner, more readable syntax for plotting? #332

Open Aerlinger opened 10 years ago

Aerlinger commented 10 years ago

One of the frustrations I've always had with almost all plotting libraries is the verbose and obscure statements needed when producing rich plots, especially those with many annotations and layers.

Gadfly's syntax follows that of ggplot:

plot(data::AbstractDataFrame, elements::Element...; mapping...)

This format is good when dealing with very simple plots with only one layer, but it quickly gets out of hand when dealing with more complex plots. Shoving several arguments (almost all of which are optional keyword arguments) into a single function feels like an abuse of Julia's elegant syntax, especially when considering Julia isn't bound to the same syntactical constraints of R.

In my opinion, it's worth considering a more readable and declarative syntax by passing an anonymous function to plot() via a block.

For instance, consider the following example from the manual (http://dcjones.github.io/Gadfly.jl/geom_ribbon.html):

xs = 0:0.1:20

df_cos = DataFrame(
    x=xs,
    y=cos(xs),
    ymin=cos(xs) .- 0.5,
    ymax=cos(xs) .+ 0.5,
    f="cos"
)

df_sin = DataFrame(
    x=xs,
    y=sin(xs),
    ymin=sin(xs) .- 0.5,
    ymax=sin(xs) .+ 0.5,
    f="sin"
)

df = vcat(df_cos, df_sin)
p = plot(df, x=:x, y=:y, ymin=:ymin, ymax=:ymax, color=:f, Geom.line, Geom.ribbon)

Perhaps a syntax like this would be a better substitute:

df_cos = DataFrame(
    x=xs,
    y=cos(xs),
    ymin=cos(xs) .- 0.5,
    ymax=cos(xs) .+ 0.5,
    f="cos"
)

df_sin = DataFrame(
    x=xs,
    y=sin(xs),
    ymin=sin(xs) .- 0.5,
    ymax=sin(xs) .+ 0.5,
    f="sin"
)

p = plot do
  # Can also add something like a default layer as a basis for the default aesthetic to remove above duplication
  # base_layer(df_cos, x=:x, y=:y, color=:f, Geom.line, Geom.ribbon)
  # layer(df_cos)
  layer(df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon)
  layer(df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon)

  # (ylim bounds could automatically be inferred from the range of y (ymin-max in this example))
end

The benefit becomes more clear when adding content to the plot:

p = plot do
  # *Note:* Perhaps we could also add default aesthetics here to remove some duplication in the arg calls
  # `base_layer(df_cos, x=:x, y=:y, color=:f, Geom.line, Geom.ribbon)`
  # `layer(df_sin)`  # Inherits aesthetics from the base layer, only overrides what is passed in

  layer(df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon)
  layer(df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon)

  # Add title, labels, and change Theme
  title("Amplitude vs. Sample Number")
  x_label("Sample #")
  y_label("Amplitude")

  Scale.y_continuous(minvalue = 1.5, maxvalue = 1.5)
  label(.5, 1, "A static label at position .5, 1")

  Theme(panel_fill=color("black"), default_color=color("orange"))
end

# Above plot settings would not pollute the outer scope

typeof(p) #... Would still be a Gadfly Plot object.

draw(D3("plot.js", 6inch, 6inch), p)

There are several other benefits:

  1. Block would limit scope of variable declarations and any change of state
  2. Two-way compatibility with existing plot() function
  3. Single statement per line
  4. Don't need to rely on commas to delimit each statement
  5. Less coupling to the signature of the plot function (Plot definition could be independent of order of arguments passed in)
  6. Adheres to the language of Wilkinson's Grammar of Graphics

Anyhow, I hope these points are helpful and I'm not coming off as being too critical. In my opinion, Gadfly is the best plotting tool for Julia. However, if the syntax of Gadfly leveraged Julia's expressive syntax it could become a more appealing alternative to ggplot, Matlab or any other plotting tool.

Thoughts?

dcjones commented 10 years ago

I'm sympathetic for the need for a more incremental way of building complex plots. Of course ggplot2 has this with an overloaded addition operator so you can do things like,

p = plot(df, aes(x, y))
p <- p + geom_line()

I didn't copy that, not because I don't like defining plots incrementally, but rather I don't like commandeering arithmetic operators for non-arithmetic operations.

I like the gist of the syntax you're proposing, but I'm not sure how it could be implemented. With Julia's do syntax, this would get translated into something like:

p = plot(() -> begin
  layer(df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon)
  layer(df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon)
end)

Only the last layer will be returned, with no way to associate the first layer with the plot.

Here are two other possibilities:

Macro

Define a @plot macro that evaluates a bunch of statements in a block and splices them into a plot(...) call. So we'd do something like:

p = @plot begin
  layer(df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon)
  layer(df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon)
end

push!

Define push! method over plots so they can built incrementally.

p = plot()
push!(p, layer(df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon))
push!(p, layer(df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon))
catawbasam commented 10 years ago

+1 for push!

johansigfrids commented 10 years ago

I think the @plot begin ... end seems like a very good syntax.

Aerlinger commented 10 years ago

I like the @plot begin .. end syntax as well, although macros sometimes make me a bit nervous. As to your point about the plot syntax, it would also be possible to pass the plot object to the block rather than using push!:

plot() do p
  layer(p, df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon)
  layer(p, df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon)
end

Perhaps there's a way to treat the function as a thunk to avoid having to pass an unnecessary parameter to the block, although I'd have to think about it a bit more.

StefanKarpinski commented 10 years ago

The push! approach feels better to me. One big advantage of not limiting pushability to a block scope is that you might want to add stuff after you evaluated the code that created the original plot. I'm not sure how that would work in IJulia, for example – would it update the original plot output or render the plot again with more stuff?

StefanKarpinski commented 10 years ago

Let me elaborate a little on "feels better": push! is simple and it's obvious what's happening. The reason for a block context like plot() do p ... end provides would be if there's cleanup that needs to happen after all the incremental statements defining the plot have occurred. If that's not the case, then the block context is gratuitous. The reason for a macro would be to do something fancy that transforms the inner code, which seems really gratuitous – it's better to make the mechanics of what's happening obvious and slightly more verbose than to save a little typing at the cost of making it completely opaque what's happening. If you're modifying a plot object, then it should look like that's what's happening.

ivarne commented 10 years ago

The macro seems like a pretty heavy weight solution to the problem of adding a , at the end of lines when using the plot function.

Aerlinger commented 10 years ago

One big advantage of not limiting pushability to a block scope is that you might want to add stuff after you evaluated the code that created the original plot.

I agree, but the more I actually think about it the more I think push! may be unnecessary. Perhaps a plot should be considered immutable once created? Currently, I believe this is the case in Gadfly.

pygy commented 10 years ago

I vote for both :-).

The block scope allows to make the plot definition stand out, and it allows to call the rendering logic on finalization. It can be cleanly built it on top of the lower level push!() API.

Passing a specialized layer function is visually more pleasing to me:

plot() do layer
  layer(df_cos, x=:x, y=:y, color="blue", Geom.line, Geom.ribbon)
  layer(df_sin, x=:x, y=:y, color="yellow", Geom.line, Geom.ribbon)
  SVG("myplot.svg", 6inch, 3inch)
end

# prototype. Note that I'm not familiar with Gadfly.
function plot(block::Function)
 p = Plot()
 layer(args...; kwargs...) = push!(p, args...; kwargs...)
 target = block(layer)
 if isa(target, RenderTarget) || isdefined(:Cairo) && isinteractive()
   isa(target, RenderTarget) || (target = defaulttarget) # Cairo
   draw(target, p)
 end
 p
end

EDIT: The block syntax is also convenient while tweaking a plot at the REPL. When navigating the history, you get the whole plot definition, you don't have to push!() many times... You could use a begin ... end block to that end, but I'm not sure that everyone will think about it. Providing the block API nudges users in the right direction.

snotskie commented 10 years ago

Casting my vote for push!. For completeness though, append! could have some uses, especially when two complex plots need to be made that share a number of layers.

append!(p, default_layers)
append!(q, default_layers)
push!(p, another_layer)
push!(q, different_layer)
dpastoor commented 10 years ago

What about development of a chain operator and subsequently using that to add additional layers than nesting everything in a for loop.

check out hadley's successor to ggplot (ggvis) for how they're handling it. ggvis link

An example:

mtcars %>% 
  ggvis(~wt, ~mpg) %>%
  layer_points() %>%
  layer_model_predictions(model = "lm", se = TRUE)

and more in the ggvis cookbook

curvi commented 9 years ago

Generally the push, append notation has the benefit of beeing modifyable (especially in the REPL). But the argument of clearness doesn't count for me. It is clear for everyone, that in a huge function, the function is defined. Even for hundreds of lines (which is bad style, but still clear). So why would you rather say:

equation(strangesyntax:here, 1+1)
equation(strangesyntax:here, 1+1)

rather than:

I'm doing math now{
  1+1
  1+1
}

How simple the push command is, it will never be as readable as a block. And we are not talking about plot(sin,0,pi) but rather sophisticated plots here. So repeatedly saying who you want to adress is exaclty what should be avoided! For the above reasons, both syntaxes would be a perfect combo having best of both worlds. But for the sake of choice, a human readable block is to be prefered! And adding a comma is definately a bad thing after each argument, if there may be enough! And everybody publishing any plots should have encountered the unreadability of these plotting scripts.

neilpanchal commented 9 years ago

+1 push! I think it reads better and is more intuitive.

dcjones commented 9 years ago

@neilpanchal Thanks for reminding about this. I just added a push! function.

The idea has been floated (for example in https://github.com/JuliaLang/julia/issues/11030) to make ++ a generic concatenation operator in Julia. If that happens, which I think would be reasonable, we might automatically get syntax like

plot(x=rand(10), y=rand(10)) ++
    Geom.line ++ Geom.point ++
    layer(x=rand(50), y=rand(50), Geom.hexbin))

That may satisfy some of those who want an operator for building plots. Short of that, I'm not in favor of adding any special operators to Gadfly. Cryptic, special-case syntax can be useful, but there needs to a very compelling justification.

bramtayl commented 8 years ago

It's worth mentioning that chaining interfaces well with push!

using Gadfly
using Lazy
using DataFrames

up = DataFrame(x = [1, 2], y = [1, 2])
down = DataFrame(x = [1, 2], y = [2, 1])

@> begin
  plot()
  push!(@> up layer(x = :x,
                    y = :y,
                    Geom.line) )
  push!(@> down layer(x = :x,
                      y = :y,
                      Geom.line) )
end

Another push! alternative would be to modify layer so that it is inherently iterative.

p = plot()
p_up = layer(p, up, x=:x, y=:y, Geom.line))
p_both = layer(p_up, down, x=:x, y=:y,Geom.line))

It looks like iterative building works for layers but not elements? Could it me possible to do something like this?

p_data = plot(up)
p_aes = element(p_data, x = :x, y = :y)
p_line = element(p_aes, Geom.line)
p_point = element(p_aes, Geom.point)

or, even better,

p_data = plot_data(up)
p_aes = aes(p, x = :x, y = :y)
p_line = geom_line(p_aes)
p_point = geom_point(p_aes)

This kind of syntax would also interface well with chaining.

@> begin
  plot()
  layer( @> begin
          up
          plot_data
          aes(x = :x, y = :y)
          geom_line
        end )
  layer( @> begin
          down
          plot_data
          aes(x = :x, y = :y)
          geom_line
        end )
end

In fact, @dpastoor, this is exactly the framework used by ggvis

The general strategy would be to build a set of functions which take a plot as an argument and return an enhanced plot. And in this case, a plot is simply a set of instructions and not linked to actual graphics until printed.

Edit: Never mind, I think the strategy below works much better for this kind of thing.

using Lazy
using DataFrames
using Gadfly
using RDatasets

type Args
  pos::Vector{Any}
  key::Vector{Tuple}
end

function Args()
  Args(convert(Vector{Any}, [] ),
       convert(Vector{Tuple}, [] ) )
end

function add(args::Args, pos...; key...)
  Args([args.pos, pos...], [args.key, key...])
end

function call(args::Args, fun)
  fun(args.pos... ; args.key...)
end

@> begin
  Args()
  add(dataset("HistData", "ChestSizes"))
  add(x = "Chest", y = "Count")
  add(Geom.bar)
  call(plot)
end