geom_binomdensity - Githubissues

DominiqueMakowski commented 3 years ago

An idea for https://github.com/easystats/see/issues/135 related to https://github.com/easystats/modelbased/issues/120

Aim: a geom for dot-densities for binomial y variables. Mostly a helper to get nice geoms without the need of much manual parametrization.

A quick draft, made simply of assembling two ggdist's geoms, works well when the two levels are equally balanced:

geom_binomdensity <- function(data, x, y, scale = 0.45, ...) {

  # Find y-axis levels
  y_levels <- levels(as.factor(data[[y]]))
  if(length(y_levels) != 2) {
    stop("The y-variable should have exactly two levels.")
  }

  # Drop NaNs
  vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
  data <- na.omit(data[vars])

  # Get upper and lower datas
  df_bottom <- data[data[[y]] == y_levels[1], ]
  df_top <- data[data[[y]] == y_levels[2], ]

  # Find adjusted scale
  if(is.null(scale)) {
    scale_bottom <- nrow(df_bottom) / nrow(data) * 0.9
    scale_top <- nrow(df_top) / nrow(data) * 0.9
  } else {
    scale_bottom <- scale
    scale_top <- scale
  }

  list(
    ggdist::geom_dots(
      aes_string(x = x, y = y),
      data = df_bottom,
      scale = scale_bottom,
      side = "top",
      ...
    ),
    ggdist::geom_dots(
      aes_string(x = x, y = y),
      data = df_top,
      scale = scale_top,
      side = "bottom",
      ...
    )
  )
}

# -----------------------------
library(ggplot2)
library(ggdist)

# Default look
ggplot() +
  geom_binomdensity(data = palmerpenguins::penguins,
                    x = "body_mass_g", y = "sex",
                    # Optional args (...)
                    color = "black", fill = "red", alpha = 0.5)

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

However, when the amount of observations is unbalanced, it looks a bit worse, since the fixed scale makes the size of the points vary (which is expected - but maybe not the optimal solution?):

data <- palmerpenguins::penguins
data[1:150, "sex"] <- "female"

# Fixed scale
ggplot() +
  geom_binomdensity(data = data,
                    x = "body_mass_g", y = "sex",
                    scale = 0.45,
                    color = "black", fill = "red", alpha = 0.5)

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

I tried adjusting the upper and lower scale based on the relative number of rows, but it doesn't make it much better (?):

# Adjusted scale
ggplot() +
  geom_binomdensity(data = data,
                    x = "body_mass_g", y = "sex",
                    scale = NULL, # scale = default
                    color = "black", fill = "red", alpha = 0.5)

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

(note that the points above are still of different sizes)

It doesn't seem to be the best way to create a geom, since it's not very flexible (it doesn't inherit parameters from the plot), but I have no idea how to do it differently (google showed me some ggproto magic, which is beyond me). Is it easy? Is it worth it?
Better scale adjustments: in my mind, I see all dots (upper & lower) having the same size, which are chosen so that the total max-height is 1 (or 0.90, to prevent potential overlapping). So that, while consistent in terms of size, the actual height of the densities are reflective of the underlying proportion of observations. But is something like that even possible?

@mjskay @bwiernik

mjskay commented 3 years ago

Ah interesting. The problem is that the dynamic binwidth must be determined when the grob created by the geom object is drawn, because it is only at that point that you have access to viewport information. If you're not familiar with ggplot internals, basically geoms create a bunch of grobs ("graphics objects"), which are objects that {grid} uses to draw stuff. When the geoms/grobs are created they don't know anything about viewport dimensions; this happens at a later step when the plot is actually rendered (this is why you can resize plots and whatnot even after creating them). It is only at this drawing step that you can do things that require information about viewport dimensions.

To deal with this in ggdist, a single geom_dots layer creates a single grob representing all the dotplots it is going to draw so that when it is drawn, it can use the plot dimensions to dynamically select a binwidth that works across all dotplots it draws. The snag you are hitting is that because you have to create two geom_dots objects (since they don't have the same value of side, and possibly also not scale?), this also creates two grobs and so the binwidth is selected separately for each dotplot.

Some possible solutions:

I make it so that you can vary both scale and side within a single geom_dots object. This is in theory possible but might be a bit of a hairy rewrite of some ggdist internals, since these things in ggdist are actually managed by a meta-geom that is a bit complicated and powers several other geoms. In principle I'm not opposed, but I'm not sure it's an easy instant fix.
I expose the dynamic bin selection algorithm from ggdist so that you can easily construct your own custom grob + geom to make this chart. This would also be a bit hairy (and would require careful thought about what that API could look like). It would be probably less hairy than (1) for me but more work for you (since you'll have to make a grob and a geom).
I come up with a way to link two grobs together so that they calculate a shared binwidth dynamically at draw time. This is something I've already thought about a bit for linking binwidths across facets (currently not possible since different facets have different grobs on them). A basic sketch would be to use a reference class object to share info across grobs so that they can consider each others' data when dynamically picking bin widths. However, I haven't tried this and I'm not even sure it's possible.

DominiqueMakowski commented 3 years ago

The less hairy option for you would be the best afaic :)

From my understanding, option 2) sounds like a way of addressing the two issues I mentioned at the same time, i.e., so that geom_binomdensity is actually a real geom with all its advantages (inheritance of arguments, etc.) and the scale-thing is fixed.

It would indeed require us to understand a bit more how to use these grobs, but that's something that I foresee we'll have to do anyway (#135).

would require careful thought about what that API could look like

I suppose it's not as simple as putting the code in a separate bin_parameters() 😅?

(thanks a lot for your help)

mjskay commented 3 years ago

Happy to help! This gives me a good reason to factor out binning stuff and clean it up a bit, which I've been meaning to do anyway.

I suppose it's not as simple as putting the code in a separate bin_parameters() 😅?

Heh :). Sadly no, especially if I want to make some kind of commitment to maintaining a public API. Currently the bin detection code is tightly coupled with other dotplot stuff in the makeContent.dots_grob function, which is more or less what you'll have to make a custom variant of.

While that function looks like a beast, it's really just three main steps: (1) determine bin size, (2) calculate data positions using that bin size, (3) create the grobs. Most likely I will end up with a function for (1) and a function for (2); (3) is pretty straightforward on its own. Hopefully you will then be able to use those two functions to make a custom grob easily.

bwiernik commented 3 years ago

Could this be done more simply by adding an option to do a normal dotplot density but to run to flip the orientation?

mjskay commented 3 years ago

Could this be done more simply by adding an option to do a normal dotplot density but to run to flip the orientation?

Not sure what you mean. Unless I'm misunderstanding that's essentially what the existing version does. Can you elaborate?

bwiernik commented 3 years ago

So right now I can do:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(ggplot2)
library(ggdist)
RankCorr_u_tau %>%
  filter(i %in% 1:2) %>%
  ggplot() +
  aes(x = u_tau, y = factor(i), fill = stat(x > 6)) +
  stat_dotsinterval(quantiles = 100)

^{Created on 2021-06-14 by the reprex package (v2.0.0)}

I think stat_dots is calculating the two distributions of dots at the same time, then plotting them at their respective y axis heights. What I'm wondering is if an option to flip the orientation of one of the distributions so that it stacks down from the y height, rather than up, would be possible. This would still entail the same size computation, etc. I think.

bwiernik commented 3 years ago

Otherwise @DominiqueMakowski I think two dotplots like shown above with a sigmoid line would be the best obvious solution.

bwiernik commented 3 years ago

@mjskay So something like a direction argument, with default to "positive" and taking a vector of "positive" and "negative" values.

mjskay commented 3 years ago

Yeah, that was my suggestion number 1:

I make it so that you can vary both scale and side within a single geom_dots object. This is in theory possible but might be a bit of a hairy rewrite of some ggdist internals, since these things in ggdist are actually managed by a meta-geom that is a bit complicated and powers several other geoms. In principle I'm not opposed, but I'm not sure it's an easy instant fix.

direction would basically be the side parameter, which currently exists but can't be varied within-geom. Because side interacts with a bunch of stuff (like geom positions --- dodging and jittering and whatnot) it might be a bit hairy to allow it to vary. I suppose I could add yet another parameter just for this use case, but it feels a little hacky given the redundancy between the two parameters.

bwiernik commented 3 years ago

Got it. Sorry I didn't follow before. I don't think a second parameter makes sense. How hairy do you think allowing side to vary a factor would be ?

mjskay commented 3 years ago

no worries, I probably wasn't explaining the internals very clearly :).

How hairy do you think allowing side to vary a factor would be ?

It would be... interesting. Currently side is used in a bunch of places since it determines several things, including how slabs are scaled and what the interpretation of the justification parameter is. You can get a sense of this by searching for side in geom_slabinterval.R.

On the one hand it could be cool to be able to do this, and in theory I think it's possible, but I don't have a clear idea atm of the best way to re-arrange the internals to make it happen.

mjskay commented 3 years ago

Aside: I have basically factored out most of the dynamic binning stuff into find_dotplot_binwidth() (for dynamically finding the binwidth) and bin_dots() (for actually getting the bins). The result is that the implementation of the grob itself is now much simpler: see here

The result is if you do want to go route 2 (a custom grob and geom) you should be able to base it on these two functions much more easily.

mjskay commented 3 years ago

@bwiernik I think you've managed to nerd snipe me because I keep thinking about how to make side/scale vary within a geom. So I'm probably gonna give that a try soon.

bwiernik commented 3 years ago

Squidward rubbing his hands together saying "My plan is falling into place

mjskay commented 3 years ago

Okay, if you install the aes-side branch of ggdist from github (devtools::install_github("mjskay/ggdist@aes-side")) this should work now. I had to make side, scale, and justifcation into aesthetics that can vary across the geom.

something like this should work:

p = plot(modelbased::estimate_relation(glm(sex ~ body_mass_g, data = palmerpenguins::penguins, family = "binomial")))

prop = as.vector(prop.table(xtabs(~ sex, palmerpenguins::penguins)))

p +
  ggdist::geom_dots(
    aes(
      x = body_mass_g, 
      y = sex,
      side = ifelse(sex == "male", "bottom", "top"),
      scale = prop[sex] * 0.9,
      justification = ifelse(sex == "male", 1, 0)
    ),
    data = palmerpenguins::penguins,
    na.rm = TRUE,
    color = "black",
    fill = "gray50",
    alpha = 0.5
  ) +
  theme_light()

DominiqueMakowski commented 3 years ago

Wow @mjskay that's blazing fast and awesome 🤩 We'll try it out asap!

DominiqueMakowski commented 3 years ago

I think I installed correctly the branch, but I have this error:

library(ggplot2)
library(ggdist)

prop <- as.vector(prop.table(xtabs(~ sex, palmerpenguins::penguins)))

# Default look
ggplot() +
  ggdist::geom_dots(
    aes(
      x = body_mass_g,
      y = sex,
      side = ifelse(sex == "male", "bottom", "top"),
      scale = prop[sex] * 0.9,
      justification = ifelse(sex == "male", 1, 0)
    ),
    data = palmerpenguins::penguins,
    na.rm = TRUE,
    color = "black",
    fill = "gray50",
    alpha = 0.5
  )
#> Error in if (!all(d[[a]] == d[[a]][[1]])) {: missing value where TRUE/FALSE needed

^{Created on 2021-06-16 by the reprex package (v2.0.0)}

The prop object returns a unnamed vector, so I'm not sure how does it gets the "sex" info from it then?

mjskay commented 3 years ago

ah sorry that should be:

prop = prop.table(xtabs(~ sex, palmerpenguins::penguins))

p +
  ggdist::geom_dots(
    aes(
      x = body_mass_g, 
      y = sex,
      side = ifelse(sex == "male", "bottom", "top"),
      scale = as.vector(prop[sex] * 0.9),
      justification = ifelse(sex == "male", 1, 0)
    ),
    data = palmerpenguins::penguins,
    na.rm = TRUE,
    color = "black",
    fill = "gray50",
    alpha = 0.5
  ) +
  theme_light()

DominiqueMakowski commented 3 years ago

Same error 🤔

library(ggplot2)
library(ggdist)

data <- palmerpenguins::penguins

prop <- prop.table(xtabs(~ sex, data))

ggplot() +
  ggdist::geom_dots(
    aes(
      x = body_mass_g,
      y = sex,
      side = ifelse(sex == "male", "bottom", "top"),
      scale = as.vector(prop[sex] * 0.9),
      justification = ifelse(sex == "male", 1, 0)
    ),
    data = data,
    na.rm = TRUE
  )
#> Error in if (!all(d[[a]] == d[[a]][[1]])) {: missing value where TRUE/FALSE needed

^{Created on 2021-06-16 by the reprex package (v2.0.0)}

mjskay commented 3 years ago

huh weird. Could you run this again to be sure:

devtools::install_github("mjskay/ggdist@aes-side")

and if that doesn't work send along the output of traceback() and sessionInfo()? Thanks!

mjskay commented 3 years ago

nevermind your code is giving me an error here too, let me check it out

mjskay commented 3 years ago

Ah, the problem was the presence of some NAs in sex meant that prop[sex] yields NA for some rows, which was not being properly handled. I have fixed that now, so if you update to the latest version (still on the aes-side branch) you should see this output:

library(ggplot2)
library(ggdist)

data <- palmerpenguins::penguins

prop <- prop.table(xtabs(~ sex, data))

ggplot() +
  ggdist::geom_dots(
    aes(
      x = body_mass_g,
      y = sex,
      side = ifelse(sex == "male", "bottom", "top"),
      scale = as.vector(prop[sex] * 0.9),
      justification = ifelse(sex == "male", 1, 0)
    ),
    data = data,
    na.rm = TRUE
  )

Which is the correct output for that input (since a slab can't be drawn with scale = NA since its size would be unknown). Simple fix is to drop the NAs prior to plotting:

library(ggplot2)
library(ggdist)

data <- palmerpenguins::penguins[!is.na(palmerpenguins::penguins$sex),]

prop <- prop.table(xtabs(~ sex, data))

ggplot() +
  ggdist::geom_dots(
    aes(
      x = body_mass_g,
      y = sex,
      side = ifelse(sex == "male", "bottom", "top"),
      scale = as.vector(prop[sex] * 0.9),
      justification = ifelse(sex == "male", 1, 0)
    ),
    data = data,
    na.rm = TRUE
  )

DominiqueMakowski commented 3 years ago

It works!

So I put it together with the necessary preprocessing into a little convenience function:

library(ggplot2)
library(ggdist)

geom_binomdensity <- function(data, x, y, ...) {

  # Find y-axis levels
  y_levels <- levels(as.factor(data[[y]]))
  if(length(y_levels) != 2) {
    stop("The y-variable should have exactly two levels.")
  }

  # Drop NaNs
  vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
  data <- na.omit(data[vars])

  # Other parameters
  data$.side <- ifelse(data[[y]] == y_levels[1], "top", "bottom")
  data$.justification <- ifelse(data[[y]] == y_levels[1], 0, 1)
  prop <- prop.table(xtabs(paste("~", y), data))
  data$.scale <- as.vector(prop[data[[y]]] * 0.9)

  # ggdist geom
  ggdist::geom_dots(
      ggplot2::aes_string(
        x = x,
        y = y,
        side = ".side",
        justification = ".justification",
        scale = ".scale"
      ),
      data = data,
      na.rm = TRUE,
      ...
    )
}

It seems to work okay in most cases, although you would expect the bottom density to be "taller" in the bottom example (when there is a lot more females than males). Here it seems like the unbalancing leads to a decrease of the points size.

# Default case
data <- palmerpenguins::penguins

ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex")

# Unbalanced
data[1:150, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex")

# More unblanaced
data[1:250, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex")

^{Created on 2021-06-17 by the reprex package (v2.0.0)}

Another improvement would be to turn that convenience function into a real geom so that it can inherit from the data and aesthetics. Any quick suggestion on how to do that (if it's easy to do)?

mjskay commented 3 years ago

Ah, I think I see the problem. Here's a modified version of your geom that adds two things: (1) some lines indicating where the dotplots should be scaled to be less than (to make sure scaling is working) and (2) a couple of other methods of choosing the scale (I explain below).

geom_binomdensity <- function(data, x, y, scale_method = c("density","prop","sqrt_prop"), ...) {

  # Find y-axis levels
  y_levels <- levels(as.factor(data[[y]]))
  if(length(y_levels) != 2) {
    stop("The y-variable should have exactly two levels.")
  }

  # Drop NaNs
  vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
  data <- na.omit(data[vars])

  # Other parameters
  data$.side <- ifelse(data[[y]] == y_levels[1], "top", "bottom")
  data$.justification <- ifelse(data[[y]] == y_levels[1], 0, 1)
  prop <- switch(scale_method,
    density = {
      density_height <- sapply(split(data, data[[y]]), function(df) {
        max(density(df[[x]], na.rm = TRUE)$y) * nrow(df)
      })
      density_height / sum(density_height)
    },
    prop = {
      prop.table(xtabs(paste("~", y), data))
    },
    sqrt_prop = {
      sqrt_prop <- sqrt(prop.table(xtabs(paste("~", y), data)))
      sqrt_prop / sum(sqrt_prop)
    }
  )
  # NOTE: I added as.character() here because I realized without it it may
  # not select the correct value from prop if data[[y]] is a factor (since it
  # would select based on the numeric version of the factor, not its level name)
  data$.scale <- as.vector(prop[as.character(data[[y]])] * 0.9)

  # ggdist geom
  list(
    ggdist::geom_dots(
      ggplot2::aes_string(
        x = x,
        y = y,
        side = ".side",
        justification = ".justification",
        scale = ".scale"
      ),
      data = data,
      na.rm = TRUE,
      ...
    ),
    geom_hline(yintercept = c(prop[[1]], - prop[[2]]) * .9 + c(1,2))
  )
}

The problem is that we shouldn't expect the heights of the dotplots to be proportional to group size, we should expect the areas to be. So for your example the dotplots are being correctly scaled, it's just that the proportions based on raw group size aren't quite right:

data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "prop")

My first thought was to instead scale by the relative heights of their densities, which improved things but not perfectly:

data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "density")

I also tried using a histogram as a density estimator with similar results.

So then I tried just using a dumb area-based heuristic of the square root of the proportion instead of the raw proportion:

data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop")

That worked surprisingly well! Also seems to do okay if I change the proportion:

data <- palmerpenguins::penguins
data[1:200, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop")

data <- palmerpenguins::penguins
data[1:100, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop")

No idea how well this will work on other datasets. Ultimately the problem is that the real quantity we care about is the height of the tallest bin, which we don't know until draw time :). So I'd say easiest thing is to use a heuristic that seems to do well most of the time (maybe sqrt of proportion) and let the user set the scaling proportion manually if needed.

Re: the geom thing, you will probably want to create a geom that calls down to setup_params, setup_data, and draw_panel methods from ggdist::GeomDots. You can see an example of this kind of geom in ggplot2::geom_pointrange, which delegates most of its work to GeomLinerange (and GeomPoint): https://github.com/tidyverse/ggplot2/blob/master/R/geom-pointrange.r

You may also want to check out https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html

bwiernik commented 3 years ago

Square root of the proportion makes sense from a geometric perspective. I suggest we go with that.

DominiqueMakowski commented 3 years ago

Sorry for the delay on that, but I finally found some time to double-check and it works like a charm. Thanks a ton again @mjskay ☺️

For now I've added ggdist to Suggest but we'll probably end up importing it as it'll be useful in other of our functions.

library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(see)

data <- iris[1:100, ]

ggplot() +
  geom_binomdensity(data, x = "Sepal.Length", y = "Species")

# Different scales
data[1:70, "Species"] <- "setosa" # Create unbalanced proportions

ggplot() +
  geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "auto")

ggplot() +
  geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "density")

ggplot() +
  geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "proportion")

ggplot() +
  geom_binomdensity(data, x = "Sepal.Length", y = "Species",
                    scale = list("setosa" = 0.4, "versicolor" = 0.6))

^{Created on 2021-07-11 by the reprex package (v2.0.0)}

The next step is to consolidate that (and the other "geoms" from see) and make it a real geom that takes an actual aesthetics call instead of strings... but this is a bit outside of my range; @bwiernik and @IndrajeetPatil I think this is more your jazz :)

mjskay commented 3 years ago

Lovely! Glad to be of assistance :).

bwiernik commented 3 years ago

@IndrajeetPatil I've never built a geom before. Can you take care of this?

IndrajeetPatil commented 3 years ago

Okay, I can take the first stab at it.

mjskay commented 3 years ago

FYI the version of ggdist supporting this (3.0) is now on CRAN. It includes a minor improvement compared to the examples above in that you should only need to set the side aesthetic (justification should be set correctly automatically based on side if you omit it).

easystats / see

geom_binomdensity #147