JuliaPlots / StatsPlots.jl

Statistical plotting recipes for Plots.jl
Other
440 stars 90 forks source link

Simple 2d plots with custom statistics #218

Open Djoop opened 5 years ago

Djoop commented 5 years ago

As far as I understand regarding 2d plots, some commands can compute statistics for each x-value of the data and plot them (like boxplot, violin), while some other expect the user to pre-compute the statistics (like when using plot(x, yerror=…) from Plots, or even OHLC plots from Plots, although they seem to me to be the same "kind" of plots than boxplots so I would naturally be expecting the same interface). I am not aware, for instance, of any function that can compute the mean and std (or, actually, any other user-provided function, one might want to plot the median with error bars showing the variance) for each value of x, and then plot them. Is it somehow supported, or really missing? One can pass a function to ribbon, but I don't think this is the same usecase as there is no "reduction" on x?

If it is not yet supported, I think it would be relevant to add that to StatsPlots, but I don't know what would be the best interface: should we create a new command for that, or rather add a new keyword argument to make clear if the different y-values should be reduced or considered as independent points?

Here is a minimalistic example (with a different type for convenience):

@userplot PlotMulti 
@recipe function f(h::PlotMulti)
    @assert length(h.args) == 2
    seriestype := :plotstats
    h.args
end
@recipe function f(::Type{Val{:plotstats}}, x, y, z; 
                   y_statistic = median, 
                   yerror_statistic = std )
    df = DataFrame([x, y], [:x,:y])
    # Compute the "reduced" data
    r = by(df, :x, 
                staty = :y => y_statistic, 
                stdy = :y => yerror_statistic, 
                Nforx = :y => length  )
    sort!(r, :x)
    rx = r[:x]
    ry = r[:staty]
    ryerror = r[:stdy]
    @series begin
        seriestype := :path
        x := rx
        y := ry
        primary := true
        ()
    end
    @series begin
        seriestype := :yerror
        x := rx
        y := ry
        yerror := ryerror
        primary := false
        ()
    end
    nothing
end

This is of course assuming there are "redundant" values for x, otherwise it becomes a different problem.

mkborregaard commented 5 years ago

You might want to check out @piever 's GroupedErrors package: https://github.com/piever/GroupedErrors.jl

piever commented 5 years ago

I also think this is the job for GroupedErrors (and in general it may good to keep some decoupling between the plot and the data manipulation here). I'm planning to do some cleanup on GroupedErrors as I think it uses more macros than it should but haven't had time for that yet, so it would be helpful to have specific feedback on the specific use case and syntax you'd like (I'm no expert on plotting terminology so please explain things in the detail - domain specific acronyms like OHLC can be hard to guess to non domain experts).

Ideally I'd prefer to not have this as a recipe as I think the computational part is substantial (computing summary statistics on grouped data), so I think the best way forward for a "GroupedErrors cleanup" is to add a simple interface to do this analysis and then call a plot command on the result (this way different plotting packages could be supported).

Djoop commented 5 years ago

OK, thanks I was not aware of this package, looks like it can indeed do a lot already. But is there really a justification for the use of macros and the specific syntax? I would include as much as possible as recipes to fit in the Plots.jl framework, but I guess your package is more generic and can serve different purposes. But are there other good reasons to decouple data manipulation from recipes, or is it just personal preference?

For my personal usage, I often have experiments that result in one huge table, and it is convenient to plot one column against one other, grouping according to some other columns, in one line. Most of the time I use 2d plots, but it would be convenient to do the same for colorplots/contour (computing some statistic for the z variable), it should be really similar.

For OHLC, I've never used that either but just saw an example here: http://docs.juliaplots.org/latest/examples/gr/#openhighlowclose.

mkborregaard commented 5 years ago

plot one column against one other, grouping according to some other columns, in one line.

Do you mean like @df mydf scatter(:x, :y, group = :z)?

piever commented 5 years ago

But is there really a justification for the use of macros and the specific syntax?

No.... I was simply very enthusiastic about macros at the time: I plan to rewrite it so that there is a "normal" way to access all the functionality, but haven't gotten around doing that yet...

But are there other good reasons to decouple data manipulation from recipes, or is it just personal preference?

If the analysis takes time (taking summary statistics of large data), it feels wasteful to have to redo the analysis to change linewidth or markercolor (and some users may wish to save the result of the analysis). See https://github.com/JuliaPlots/StatPlots.jl/pull/30 for the initial discussion of the feature.

Djoop commented 5 years ago

plot one column against one other, grouping according to some other columns, in one line.

Do you mean like @df mydf scatter(:x, :y, group = :z)?

Scatter can be useful, but no I was thinking plotting "some statistic" of a variable against (unique values of) another variable.

If the analysis takes time (taking summary statistics of large data), it feels wasteful to have to redo the analysis to change linewidth or markercolor (and some users may wish to save the result of the analysis). See #30 for the initial discussion of the feature.

I understand that in some usecases time can be an issue, it makes sense. I'll try to read the discussion it's probably useful to better understand the motivations, thanks.