ProjectMOSAIC / ggformula

Provides a formula interface to 'ggplot2' graphics.
Other
39 stars 11 forks source link

gf_percents() query #118

Closed nicholasjhorton closed 5 years ago

nicholasjhorton commented 5 years ago

Rebecca Andridge noted:

On Jan 4, 2019, at 3:13 PM, Andridge, Rebecca andridge.1@osu.edu wrote:

Hi,

Hopefully this is a quick question: when using the gf_percents() function to make a bar chart with 2 categorical variables, for example:

gf_percents( ~ bmicat | sex, data=fish)

what appears to get plotted are percentages of the whole sample, instead of the percentages within sex. This is unexpected behavior, at least to us, as the equivalent tally() command:

tally( ~ bmicat | sex, data=fish, format="percent")

gives the percent in each BMI category within each sex category.

Is this a bug? Is there a way to get the “clustered bar chart” showing the percentages within each group?

As a side note: we are in the process of converting our undergraduate biostatistics course from using JMP to R and are using your mosaic package – so far it seems a nice, slightly gentler, and more intuitive approach to R coding that we hope will not scare the students (many are math-phobic). Thank you!

Thanks, -Rebecca

I've added a reprex.

library(mosaic)
gf_percents( ~ substance | sex, data = HELPrct); 

  tally(~ substance | sex, format = "percent", data = HELPrct)
#>          sex
#> substance   female     male
#>   alcohol 33.64486 40.75145
#>   cocaine 38.31776 32.08092
#>   heroin  28.03738 27.16763

Created on 2019-01-14 by the reprex package (v0.2.1)

rpruim commented 5 years ago

I don't know if there is any easy solution for this because I'm not sure it is easy to do in ggplot2. (It is similar to, but not quite identical, to the challenges presented in https://stackoverflow.com/questions/1376967/using-stat-function-and-facet-wrap-together-in-ggplot2-in-r.)

The implementation is basically this:

library(ggplot2)
ggplot(data = mosaicData::HELPrct) +
  geom_bar(aes(x = substance, y = stat(count / sum(count))), stat = "count") +
  facet_wrap( ~ sex)

Created on 2019-01-15 by the reprex package (v0.2.1)

And the faceting is not involved in the calculations of the percents or proportions, and don't think it is easy (possible?) to calculate panel specific sums of counts.

Of course, one could instead do the calculations first, and then plot. But this isn't labeled nicely and is a bit clunky. The best way is probably to create a separate data frame with values and names as desired and to work form that.

library(mosaic)
df_stats(~ substance | sex, data = HELPrct, percs, format = "long") %>%
  gf_col(value ~ stat | sex)

Created on 2019-01-15 by the reprex package (v0.2.1)

I'll leave this open, but there may not be a particularly simple solution. Likely it would involved a complete rewrite that does that data wrangling internally -- that's very different from most of the functions in ggformula which basically translate directly between the formula interface and native ggplot2, inheriting both desired and undesired features of ggplot2 in the process.

randridge commented 5 years ago

Thank you -- very helpful (and I should have known it would be tricky b/c I know it's not actually easy to plot directly with ggplot).

Here's a slight modification of your suggested solution that uses the tally() function:

library(mosaic)
percents <- as.data.frame(tally( ~ substance | sex, data=HELPrct, format="percent")) %>% rename(Percent=Freq)
gf_col(Percent ~ substance | sex, data=percents)

The resulting data frame doesn't have the "perc_" prefix on the level labels and since we are teaching students to use tally() to get the conditional percentages, it seems to me that students will find it logical to take output from that function, make a DF, and the plot it. Only weird thing is the "Freq" variable name which has to be renamed to "Percent", but easy to do.