RFC: Plotting "non-standard" data

dcjones commented 9 years ago

I'm attempting to address what seems like the most common complaint about Gadfly: it expects data in flat columns ("long form" as opposed to "short form"), and forces you to either reshape your data when it doesn't fit this mold, or drop down into a lower level and manually set up layers. See for example: #89, #526, #529.

I was concerned that I'd have to a separate set of semantics for dealing with this, but I think I've hit upon a possible solution: reshaping should be part of the plotting pipeline. If we define Reshape plot elements that are applied to the data source before anything else happens, a lot of the ugliness goes away.

Here's a proof of concept implemented in the "reshape" branch implementing a Reshape.stack transform.

# Plotting columns of a matrix as separate lines
M = rand(10, 10)
plot(M, x=:row, y=:value, group=:column, Reshape.stack, Geom.line)

try

# Plotting dataframe columns as separate lines
df = DataFrame(x=collect(0.0:0.1:0.9), y1=rand(10), y2=rand(10))
plot(df, x=:x, y=:value, color=:variable, Reshape.stack([:y1, :y2]), Geom.line)

try

# Reshape operations can be applied on a per-layer basis.
df = DataFrame(x=collect(0.0:0.1:0.9), y1=rand(10), y2=rand(10), z=rand(10))
plot(df, layer(x=:x, y=:z, Geom.point),
         layer(x=:x, y=:value, color=:variable, Reshape.stack([:y1, :y2]), Geom.line))

try

Reshape.stack is defined just over matrices and data frames now, but that could be expanded to include higher dimensional arrays, or arrays of arrays. This implementation is not efficient, but I believe this can be made so (i.e. allocating constant or at least sub-linear memory) with JuliaLang/julia#10507 and some other tricks.

Feedback would be welcomed here.

Sisyphuss commented 9 years ago

It will be handy to be able to plot a Matrix (each column with a different colour) directly without transforming it to a DataFrame.

As a researcher, I do simulation frequently with different parameters, whose result is a tensor or even higher dimensional array. Then I do various calculations along different dimensions to get several matrices for different usage. If each time, I should transform the matrix to an intermediate DataFrame for plotting, it will make me crazy.

Moreover, DataFrame is not a primitive type in Julia, which means the knowledge of DataFrame is not a must to use Julia. Some user may just want to use Gadfly to draw pretty figures. They are unconscious about the DataFrame and not willing to pay extra time to learn it. So it is not sensible to have Gadfly to have heavy dependency on the DataFrame package.

johansigfrids commented 9 years ago

This seems like a lot of interface for saying something relatively simple. Isn't plot(df, x=:x, y=:value, color=:variable, Reshape.stack([:y1, :y2]), Geom.line) just a complicated way of saying plot(df, x=:x, y=(:y1, :y2) Geom.line)

It also requires that the user to know the details of what stacking is and how to use it. Besides, if the user knows how to stack a DataFrame, it is shorter to just do it. plot(stack(df,[:y1, :y2]), x=:x, y=:value, color=:variable, Geom.line)

dcjones commented 9 years ago

@Sisyphuss I agree that we shouldn't force people to use DataFrames. I don't think it's the case that theres a heavy dependency on DataFrames now. There's only a shortcut plot(df, x=:foo, y=:bar) provided which is roughly equivalent to plot(x=df[:foo], y=df[:bar]).

@johansigfrids This is all true. I'd be fine just continuing to tell everyone to use melt or stack, but that doesn't generalize to matrices, high dimensional arrays, arrays of arrays, etc. And as evinced by @Sisyphuss, many are not keen on being forced to manually convert their data to a DataFrame.

If there's a consistent way this syntax can be made to work, it would be pretty appealing.

plot(df, x=:x, y=(:y1, :y2), Geom.line)

binarybana commented 9 years ago

For

plot(df, x=:x, y=(:y1, :y2), Geom.line)

Currently, eval_plot_mapping will store arrays of symbols as data, but what if plot instead pre-filtered mappings to search for these mappings with symbol arrays. If one is found, we emit layers for each symbol instead of the single plot? Similar to what @aviks did here manually.

Very rough pseudocode:

function plot(data_source, elements...; mapping...)
for (k,v) in mapping
  if v is an array of symbols
    delete!(mapping, k)
    return plot(data_source, [layer(elements...; k[x], mapping) for x in v])
  end
end
# normal plot here

tbreloff commented 9 years ago

This is an old comment, but since it's still open and has the RFC tag, I'll chime in. The gist of this discussion is what made me start working on Plots. I wanted to add a "data processing layer" that can sit above these awesome packages, and turn any kind of input into the plot that you want, without needing to understand complicated internals. I love Gadfly, but I could never actually use it for my day to day work because I found it much too cumbersome to build the one-off visualizations that I need. I want to toss a matrix into a plot function and let it figure out how to slice it up. I want to be able to change global settings so that I only have to type something once. Generally, I want the api to take anything I throw at it and it just "knows" what I want.

So a quick comparison of what you suggested above:

df = DataFrame(x=collect(0.0:0.1:0.9), y1=rand(10), y2=rand(10), z=rand(10))
plot(df, layer(x=:x, y=:z, Geom.point),
         layer(x=:x, y=:value, color=:variable, Reshape.stack([:y1, :y2]), Geom.line))

and how you do the same thing in Plots:

plot(0:.1:.9, rand(10,3); linetypes=[:scatter,:line,:line])

tmp Under the hood, it's effectively doing the same thing, but it's abstracted away.

Many people (me included) typically want to talk in the "slang of graphics" as I've heard you call it, and I think it's valuable to have a translation layer between dataset and "grammar". This is also possibly a similar discussion to #680, in which there could be an additional aggregation layer of incoming data (R tree or similar) before it gets close to the lower-level layers/drawing. This way operations like zooming/panning or subset analysis (maybe updating a regression line based on a lasso-ed set of points) could be a little more efficient.

There are some basic data manipulations that I want to handle in the near term (grouping data, additional dimensions, etc) but I'm wondering... what are the big outstanding issues that are ripe for disruption? @timholy is doing cool stuff with Immerse, and it would be great to have the pipeline refined so that it's easy to extract analysis from intermediate steps. Thoughts?

bjarthur commented 7 years ago

closed by https://github.com/GiovineItalia/Gadfly.jl/pull/1013

GiovineItalia / Gadfly.jl

RFC: Plotting "non-standard" data #563