Closed DominiqueMakowski closed 3 years ago
Ah interesting. The problem is that the dynamic binwidth must be determined when the grob created by the geom object is drawn, because it is only at that point that you have access to viewport information. If you're not familiar with ggplot internals, basically geoms create a bunch of grobs ("graphics objects"), which are objects that {grid} uses to draw stuff. When the geoms/grobs are created they don't know anything about viewport dimensions; this happens at a later step when the plot is actually rendered (this is why you can resize plots and whatnot even after creating them). It is only at this drawing step that you can do things that require information about viewport dimensions.
To deal with this in ggdist, a single geom_dots layer creates a single grob representing all the dotplots it is going to draw so that when it is drawn, it can use the plot dimensions to dynamically select a binwidth that works across all dotplots it draws. The snag you are hitting is that because you have to create two geom_dots objects (since they don't have the same value of side
, and possibly also not scale
?), this also creates two grobs and so the binwidth is selected separately for each dotplot.
Some possible solutions:
scale
and side
within a single geom_dots object. This is in theory possible but might be a bit of a hairy rewrite of some ggdist internals, since these things in ggdist are actually managed by a meta-geom that is a bit complicated and powers several other geoms. In principle I'm not opposed, but I'm not sure it's an easy instant fix.The less hairy option for you would be the best afaic :)
From my understanding, option 2) sounds like a way of addressing the two issues I mentioned at the same time, i.e., so that geom_binomdensity
is actually a real geom with all its advantages (inheritance of arguments, etc.) and the scale-thing is fixed.
It would indeed require us to understand a bit more how to use these grobs, but that's something that I foresee we'll have to do anyway (#135).
would require careful thought about what that API could look like
I suppose it's not as simple as putting the code in a separate bin_parameters()
π
?
(thanks a lot for your help)
Happy to help! This gives me a good reason to factor out binning stuff and clean it up a bit, which I've been meaning to do anyway.
I suppose it's not as simple as putting the code in a separate
bin_parameters()
π ?
Heh :). Sadly no, especially if I want to make some kind of commitment to maintaining a public API. Currently the bin detection code is tightly coupled with other dotplot stuff in the makeContent.dots_grob function, which is more or less what you'll have to make a custom variant of.
While that function looks like a beast, it's really just three main steps: (1) determine bin size, (2) calculate data positions using that bin size, (3) create the grobs. Most likely I will end up with a function for (1) and a function for (2); (3) is pretty straightforward on its own. Hopefully you will then be able to use those two functions to make a custom grob easily.
Could this be done more simply by adding an option to do a normal dotplot density but to run to flip the orientation?
Could this be done more simply by adding an option to do a normal dotplot density but to run to flip the orientation?
Not sure what you mean. Unless I'm misunderstanding that's essentially what the existing version does. Can you elaborate?
So right now I can do:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
library(ggdist)
RankCorr_u_tau %>%
filter(i %in% 1:2) %>%
ggplot() +
aes(x = u_tau, y = factor(i), fill = stat(x > 6)) +
stat_dotsinterval(quantiles = 100)
Created on 2021-06-14 by the reprex package (v2.0.0)
I think stat_dots
is calculating the two distributions of dots at the same time, then plotting them at their respective y axis heights. What I'm wondering is if an option to flip the orientation of one of the distributions so that it stacks down from the y height, rather than up, would be possible. This would still entail the same size computation, etc. I think.
Otherwise @DominiqueMakowski I think two dotplots like shown above with a sigmoid line would be the best obvious solution.
@mjskay So something like a direction
argument, with default to "positive"
and taking a vector of "positive"
and "negative"
values.
Yeah, that was my suggestion number 1:
- I make it so that you can vary both scale and side within a single geom_dots object. This is in theory possible but might be a bit of a hairy rewrite of some ggdist internals, since these things in ggdist are actually managed by a meta-geom that is a bit complicated and powers several other geoms. In principle I'm not opposed, but I'm not sure it's an easy instant fix.
direction
would basically be the side
parameter, which currently exists but can't be varied within-geom. Because side
interacts with a bunch of stuff (like geom positions --- dodging and jittering and whatnot) it might be a bit hairy to allow it to vary. I suppose I could add yet another parameter just for this use case, but it feels a little hacky given the redundancy between the two parameters.
Got it. Sorry I didn't follow before. I don't think a second parameter makes sense. How hairy do you think allowing side to vary a factor would be ?
no worries, I probably wasn't explaining the internals very clearly :).
How hairy do you think allowing side to vary a factor would be ?
It would be... interesting. Currently side is used in a bunch of places since it determines several things, including how slabs are scaled and what the interpretation of the justification
parameter is. You can get a sense of this by searching for side
in geom_slabinterval.R.
On the one hand it could be cool to be able to do this, and in theory I think it's possible, but I don't have a clear idea atm of the best way to re-arrange the internals to make it happen.
Aside: I have basically factored out most of the dynamic binning stuff into find_dotplot_binwidth()
(for dynamically finding the binwidth) and bin_dots()
(for actually getting the bins). The result is that the implementation of the grob itself is now much simpler: see here
The result is if you do want to go route 2 (a custom grob and geom) you should be able to base it on these two functions much more easily.
@bwiernik I think you've managed to nerd snipe me because I keep thinking about how to make side/scale vary within a geom. So I'm probably gonna give that a try soon.
Okay, if you install the aes-side branch of ggdist from github (devtools::install_github("mjskay/ggdist@aes-side")
) this should work now. I had to make side
, scale
, and justifcation
into aesthetics that can vary across the geom.
something like this should work:
p = plot(modelbased::estimate_relation(glm(sex ~ body_mass_g, data = palmerpenguins::penguins, family = "binomial")))
prop = as.vector(prop.table(xtabs(~ sex, palmerpenguins::penguins)))
p +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = prop[sex] * 0.9,
justification = ifelse(sex == "male", 1, 0)
),
data = palmerpenguins::penguins,
na.rm = TRUE,
color = "black",
fill = "gray50",
alpha = 0.5
) +
theme_light()
Wow @mjskay that's blazing fast and awesome π€© We'll try it out asap!
I think I installed correctly the branch, but I have this error:
library(ggplot2)
library(ggdist)
prop <- as.vector(prop.table(xtabs(~ sex, palmerpenguins::penguins)))
# Default look
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = prop[sex] * 0.9,
justification = ifelse(sex == "male", 1, 0)
),
data = palmerpenguins::penguins,
na.rm = TRUE,
color = "black",
fill = "gray50",
alpha = 0.5
)
#> Error in if (!all(d[[a]] == d[[a]][[1]])) {: missing value where TRUE/FALSE needed
Created on 2021-06-16 by the reprex package (v2.0.0)
The prop
object returns a unnamed vector, so I'm not sure how does it gets the "sex" info from it then?
ah sorry that should be:
prop = prop.table(xtabs(~ sex, palmerpenguins::penguins))
p +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = palmerpenguins::penguins,
na.rm = TRUE,
color = "black",
fill = "gray50",
alpha = 0.5
) +
theme_light()
Same error π€
library(ggplot2)
library(ggdist)
data <- palmerpenguins::penguins
prop <- prop.table(xtabs(~ sex, data))
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = data,
na.rm = TRUE
)
#> Error in if (!all(d[[a]] == d[[a]][[1]])) {: missing value where TRUE/FALSE needed
Created on 2021-06-16 by the reprex package (v2.0.0)
huh weird. Could you run this again to be sure:
devtools::install_github("mjskay/ggdist@aes-side")
and if that doesn't work send along the output of traceback()
and sessionInfo()
? Thanks!
nevermind your code is giving me an error here too, let me check it out
Ah, the problem was the presence of some NA
s in sex
meant that prop[sex]
yields NA
for some rows, which was not being properly handled. I have fixed that now, so if you update to the latest version (still on the aes-side
branch) you should see this output:
library(ggplot2)
library(ggdist)
data <- palmerpenguins::penguins
prop <- prop.table(xtabs(~ sex, data))
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = data,
na.rm = TRUE
)
Which is the correct output for that input (since a slab can't be drawn with scale = NA
since its size would be unknown). Simple fix is to drop the NA
s prior to plotting:
library(ggplot2)
library(ggdist)
data <- palmerpenguins::penguins[!is.na(palmerpenguins::penguins$sex),]
prop <- prop.table(xtabs(~ sex, data))
ggplot() +
ggdist::geom_dots(
aes(
x = body_mass_g,
y = sex,
side = ifelse(sex == "male", "bottom", "top"),
scale = as.vector(prop[sex] * 0.9),
justification = ifelse(sex == "male", 1, 0)
),
data = data,
na.rm = TRUE
)
It works!
So I put it together with the necessary preprocessing into a little convenience function:
library(ggplot2)
library(ggdist)
geom_binomdensity <- function(data, x, y, ...) {
# Find y-axis levels
y_levels <- levels(as.factor(data[[y]]))
if(length(y_levels) != 2) {
stop("The y-variable should have exactly two levels.")
}
# Drop NaNs
vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
data <- na.omit(data[vars])
# Other parameters
data$.side <- ifelse(data[[y]] == y_levels[1], "top", "bottom")
data$.justification <- ifelse(data[[y]] == y_levels[1], 0, 1)
prop <- prop.table(xtabs(paste("~", y), data))
data$.scale <- as.vector(prop[data[[y]]] * 0.9)
# ggdist geom
ggdist::geom_dots(
ggplot2::aes_string(
x = x,
y = y,
side = ".side",
justification = ".justification",
scale = ".scale"
),
data = data,
na.rm = TRUE,
...
)
}
It seems to work okay in most cases, although you would expect the bottom density to be "taller" in the bottom example (when there is a lot more females than males). Here it seems like the unbalancing leads to a decrease of the points size.
# Default case
data <- palmerpenguins::penguins
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex")
# Unbalanced
data[1:150, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex")
# More unblanaced
data[1:250, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex")
Created on 2021-06-17 by the reprex package (v2.0.0)
Another improvement would be to turn that convenience function into a real geom so that it can inherit from the data and aesthetics. Any quick suggestion on how to do that (if it's easy to do)?
Ah, I think I see the problem. Here's a modified version of your geom that adds two things: (1) some lines indicating where the dotplots should be scaled to be less than (to make sure scaling is working) and (2) a couple of other methods of choosing the scale (I explain below).
geom_binomdensity <- function(data, x, y, scale_method = c("density","prop","sqrt_prop"), ...) {
# Find y-axis levels
y_levels <- levels(as.factor(data[[y]]))
if(length(y_levels) != 2) {
stop("The y-variable should have exactly two levels.")
}
# Drop NaNs
vars <- c(x, y) # later can eventually add variables specified as color, fill, ...
data <- na.omit(data[vars])
# Other parameters
data$.side <- ifelse(data[[y]] == y_levels[1], "top", "bottom")
data$.justification <- ifelse(data[[y]] == y_levels[1], 0, 1)
prop <- switch(scale_method,
density = {
density_height <- sapply(split(data, data[[y]]), function(df) {
max(density(df[[x]], na.rm = TRUE)$y) * nrow(df)
})
density_height / sum(density_height)
},
prop = {
prop.table(xtabs(paste("~", y), data))
},
sqrt_prop = {
sqrt_prop <- sqrt(prop.table(xtabs(paste("~", y), data)))
sqrt_prop / sum(sqrt_prop)
}
)
# NOTE: I added as.character() here because I realized without it it may
# not select the correct value from prop if data[[y]] is a factor (since it
# would select based on the numeric version of the factor, not its level name)
data$.scale <- as.vector(prop[as.character(data[[y]])] * 0.9)
# ggdist geom
list(
ggdist::geom_dots(
ggplot2::aes_string(
x = x,
y = y,
side = ".side",
justification = ".justification",
scale = ".scale"
),
data = data,
na.rm = TRUE,
...
),
geom_hline(yintercept = c(prop[[1]], - prop[[2]]) * .9 + c(1,2))
)
}
The problem is that we shouldn't expect the heights of the dotplots to be proportional to group size, we should expect the areas to be. So for your example the dotplots are being correctly scaled, it's just that the proportions based on raw group size aren't quite right:
data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "prop")
My first thought was to instead scale by the relative heights of their densities, which improved things but not perfectly:
data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "density")
I also tried using a histogram as a density estimator with similar results.
So then I tried just using a dumb area-based heuristic of the square root of the proportion instead of the raw proportion:
data <- palmerpenguins::penguins
data[1:300, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop")
That worked surprisingly well! Also seems to do okay if I change the proportion:
data <- palmerpenguins::penguins
data[1:200, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop")
data <- palmerpenguins::penguins
data[1:100, "sex"] <- "female"
ggplot() + geom_binomdensity(data, x = "body_mass_g", y = "sex", scale_method = "sqrt_prop")
No idea how well this will work on other datasets. Ultimately the problem is that the real quantity we care about is the height of the tallest bin, which we don't know until draw time :). So I'd say easiest thing is to use a heuristic that seems to do well most of the time (maybe sqrt of proportion) and let the user set the scaling proportion manually if needed.
Re: the geom thing, you will probably want to create a geom that calls down to setup_params, setup_data, and draw_panel methods from ggdist::GeomDots. You can see an example of this kind of geom in ggplot2::geom_pointrange, which delegates most of its work to GeomLinerange (and GeomPoint): https://github.com/tidyverse/ggplot2/blob/master/R/geom-pointrange.r
You may also want to check out https://cran.r-project.org/web/packages/ggplot2/vignettes/extending-ggplot2.html
Square root of the proportion makes sense from a geometric perspective. I suggest we go with that.
Sorry for the delay on that, but I finally found some time to double-check and it works like a charm. Thanks a ton again @mjskay βΊοΈ
For now I've added ggdist to Suggest but we'll probably end up importing it as it'll be useful in other of our functions.
library(ggplot2)
#> Warning: package 'ggplot2' was built under R version 4.0.5
library(see)
data <- iris[1:100, ]
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species")
# Different scales
data[1:70, "Species"] <- "setosa" # Create unbalanced proportions
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "auto")
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "density")
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species", scale = "proportion")
ggplot() +
geom_binomdensity(data, x = "Sepal.Length", y = "Species",
scale = list("setosa" = 0.4, "versicolor" = 0.6))
Created on 2021-07-11 by the reprex package (v2.0.0)
The next step is to consolidate that (and the other "geoms" from see) and make it a real geom that takes an actual aesthetics call instead of strings... but this is a bit outside of my range; @bwiernik and @IndrajeetPatil I think this is more your jazz :)
Lovely! Glad to be of assistance :).
@IndrajeetPatil I've never built a geom before. Can you take care of this?
Okay, I can take the first stab at it.
FYI the version of ggdist supporting this (3.0) is now on CRAN. It includes a minor improvement compared to the examples above in that you should only need to set the side
aesthetic (justification
should be set correctly automatically based on side
if you omit it).
An idea for https://github.com/easystats/see/issues/135 related to https://github.com/easystats/modelbased/issues/120
Aim: a geom for dot-densities for binomial y variables. Mostly a helper to get nice geoms without the need of much manual parametrization.
A quick draft, made simply of assembling two ggdist's geoms, works well when the two levels are equally balanced:
Created on 2021-06-14 by the reprex package (v2.0.0)
However, when the amount of observations is unbalanced, it looks a bit worse, since the fixed scale makes the size of the points vary (which is expected - but maybe not the optimal solution?):
Created on 2021-06-14 by the reprex package (v2.0.0)
I tried adjusting the upper and lower scale based on the relative number of rows, but it doesn't make it much better (?):
Created on 2021-06-14 by the reprex package (v2.0.0)
(note that the points above are still of different sizes)
ggproto
magic, which is beyond me). Is it easy? Is it worth it?@mjskay @bwiernik