daattali / ggExtra

📊 Add marginal histograms to ggplot2, and more ggplot2 enhancements
http://daattali.com/shiny/ggExtra-ggMarginal-demo/
Other
380 stars 48 forks source link

Add densigram plot type #122

Closed crew102 closed 6 years ago

crew102 commented 6 years ago

This PR closes #118. A few notes:

library(ggplot2)

ggplot(mtcars, aes(wt)) +
  geom_histogram() +
  geom_density()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


ggplot(mtcars, aes(wt)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

daattali commented 6 years ago

That is some clever code.

I understand the reason for the commit that tries to put both plots on the same scale. I had a similar issue in an old project, and I had a slightly different approach (more complex), in which I tried making sure the peak of the curve matches the tallest histogram bar. Do you think that's better, or does summing up to 1 make more sense to you? That's what my client wanted at the time, but I don't have an opinion on which one makes more sense.

I think this is the relevant code from there:

pb <- ggplot_build(p)
factor <- max(pb$data[[1]]$count)
p <- p + geom_density(aes_string(y = paste0(factor, "*..scaled..")), size = 1)
crew102 commented 6 years ago

Yeah, so my initial inclination was to go with the approach that you mentioned (i.e., have the peak of the density curve match the tallest histogram bar). I liked the idea of the density plot in a densigram looking exactly like it would in a plain density marginal plot, which you don't get when both plot types sum to 1. However, I think that it would be technically misleading if we don't have them both sum to 1. The reason for this is that, by putting both figure types (histogram and density) on the same axis, the implication is that the axis scale is the same for both types. In other words, we are implying that the y scale for the histogram is the same as the y scale for the density plot. If we don't have them both sum to 1, then technically there are two y axis scales that exist on the plot. Does that make sense?

daattali commented 6 years ago

It does make sense, but I disagree that it's important to keep both types on the same scale. They're inherently very different, histograms are understood to be on an absolute scale because people know that bars are as high as the number of observations. With density plots, the scale doesn't really mean much to anyone, it's more the shape that you look at, whereas the axis is practically meaningless. I also searched Google Images for "histogram and density plot" and it seems most of them do have the peaks roughly match (although admittedly that can be maybe because some of those "density" curves are actually meant to be smooth approximations of the histogram?).

If you still feel that the current approach is better, I'll merge.

crew102 commented 6 years ago

Agreed that the y axis scale is pretty much meaningless for density plots, but I think that the same scale should be used (meaningless as it is) for both fig types if they are to appear alongside each other.

Regarding what is standard practice, I think you'll see that most of the examples that pop up from Google Images are in fact following the approach of having both plot types sum to 1. It seems like this isn't the case b/c most of the examples are plotting a smooth variable with support across its entire range, which results in the density line being at a similar height to the histogram boxes (maybe the code snipped below will help clarify what I mean by this?).

library(ggplot2)

# here the density line is shorter than the histogram bars b/c there are "gaps" in 
# the distribution where no observations occur (i.e., no bins). the default 
# smoothing param for geom_density means these gaps have to be "covered," 
# so to speak, by the density line. this pulls the line down in those areas:
ggplot(mtcars, aes(wt)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.


# here not so much, b/c the data points are distributed according to a 
# continuous/smooth functional form:
ggplot(data.frame(x = rnorm(500)), aes(x)) +
  geom_histogram(aes(y = ..density..)) +
  geom_density()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

daattali commented 6 years ago

OK - could you just update the NEWS file?