JuliaPlots / StatsMakie.jl

Statistical visualizations based on high performance plotting package Makie
Other
48 stars 10 forks source link

Add public API to deal with grouped data #17

Open piever opened 5 years ago

piever commented 5 years ago

Reminder: while GoG assumes data is the long tidy format, one could probably be more flexible by allowing more methods to construct the PlottableTable: there could be some public API to build PlottableTables manually starting from different data structures and grouping information. An added benefit would be if this gives some equivalent of plot(x, [y1 y2]) from Plots for free.

pdeffebach commented 5 years ago

Note that there are two issues here. The first case is something that to my knowledge is not possible in ggplot or any conventional plotting packages: grouping based on two non-mutually exclusive dummy variables.

Say you want to graph a histogram of income for white people and hispanic people, but many people identify as both white and hispanic.

julia> df = DataFrame(income = randn(10), white = rand(Bool, 10), hispanic = rand(Bool, 10))
10×3 DataFrame
│ Row │ income     │ white │ hispanic │
│     │ Float64    │ Bool  │ Bool     │
├─────┼────────────┼───────┼──────────┤
│ 1   │ 0.490092   │ false │ true     │
│ 2   │ 1.05979    │ true  │ true     │
│ 3   │ 0.0334069  │ false │ true     │
│ 4   │ -0.391703  │ true  │ true     │
│ 5   │ -0.587518  │ true  │ false    │
│ 6   │ -1.02922   │ false │ true     │
│ 7   │ -0.0573893 │ true  │ false    │
│ 8   │ 2.3907     │ false │ true     │
│ 9   │ 1.08107    │ false │ false    │
│ 10  │ -0.324261  │ true  │ false    │

Grammar of Graphics assumes that a category is mutually exclusive, as it would only allow grouping based on a single categorical variable ethnicity.

What I would love to be able to do is a syntax along the lines of

plot(df, :income, Color = G([:white, :hispanic]))

Here, G is a function that makes it look like I did the following:

plot1 = @linq df |> 
    stack([:white, :hispanic] |>
    where(:value) |>
    plot(:income, Color = :race)

Note that the above scenario only works (I think), if both :white and :hispanic are dummy variables. So presumably any function would have to check if that is the case.

Perhaps this idea could be extended all the way to the grouping APIs themselves in JuliaDB and DataFrames. As far as I know, there isn't too much preventing a GroupedDataFrame from having non-mutually exclusive groups.

cc @nalimilan because this seems like something a demographer might have desired before.

I think that plot(x, [y1, y2]) is a related issue, but would require very different implementations, namely a flatten and a zip in some capacity (if at the end of the line it's all GoG-like).

mkborregaard commented 5 years ago

This type of grouping is not part of grouping APIs because it's statistically invalid, and the approach listed leads to pseudoreplication. I am a strong believer in that your plots should honestly portray your data and there should be a seamless correspondence between plots and statistics.

A statistically appropriate way (which is consistent with standard grouping) is to include a third factor level for those that self-identify as more than one ethnic group.

nalimilan commented 5 years ago

I think there are situations where it's fine to represent stats for non-exclusive subgroups. For example it can happen if you ask a batteries of yes/no questions and want to see the characteristics of people who answered "yes" for each question. In this case it's not practical (nor interesting) to have a level for each combination of possible answers.

That said, I'm not familiar enough with StatsMakie to have an opinion regarding the API.