Open bwiernik opened 2 years ago
One standard check for bernoulli models is to do a binned binomial proportions plot (this is, loosely, a graphical equivalent of the Hosmer-Lemeshow test, and is implemented in arm::binnedplot()
Yep, that's implemented in performance as binned_residuals()
ooops, I missed your comment about binned_residuals()
. D'oh!
No worries! Thanks for your comment!
To be fair, though, it is kind of difficult to know the existence of this visual check for two reasons:
check_model()
output, but we don't include it there (as you mention in your original issue):library(performance)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
x <- check_model(model)
names(x)
#> [1] "VIF" "QQ" "HOMOGENEITY" "OUTLIERS" "INFLUENTIAL"
print
method, instead of plot
, but using plot
actually works and returns incorrect plot (cf. https://github.com/easystats/see/issues/119)print(binned_residuals(model))
#> Warning: Probably bad model fit. Only about 50% of the residuals are inside the error bounds.
plot(binned_residuals(model))
#> Warning: Probably bad model fit. Only about 50% of the residuals are inside the error bounds.
Created on 2021-12-09 by the reprex package (v2.0.1)
Should we incorporate the plot into check_model?
Yes, like I said in the first post, I think we need to change most of the plots in check_model() for GLMs.
I think we also might want to add a check_predictions() plot to check_model() for all models, perhaps dropping the normal density plot?
This isn't the place, but I may have some Opinions about binned_model
(I would like a smoothed line added, binomial CIs on the points; I would rather have binomial CIs on the individual points and a confidence interval on the smooth rather than have a dichotomizing region that indicates 'good' and 'bad' points ...). I see that this is copying what arm::binnedplot
does though ...
I like that. Have an example?
Not yet but could do one tomorrow.
That would be great!
Would it be easier to pull in https://github.com/florianhartig/DHARMa as its very flexible in terms of model type?
https://gist.github.com/bbolker/945e684dc148239364ab0f99c9a1beed
I realized that the residuals from binomials are not binomial (so constructing binomial CIs isn't useful); however I think that a binned plot using mean_cl_boot()
to generate CIs is probably fine. See examples and decide what you think is useful or not ...
I think it would be good to put check_predictions()
(pp_check) into the this plot array. For linear models, we have two normality checking plots. I think it would be good to swap out the density plot for a posterior predictions plot. Thoughts?
For binomial models, the constant variance plot should be omitted.
Which panel do you exactly mean? I think internally, the names are different from plot titles. I assume you mean the plot in the top right panel?
Yes, the top right one.
Current state:
library(performance)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
check_model(model)
Created on 2022-02-25 by the reprex package (v2.0.1)
I think we also might want to add a check_predictions() plot to check_model() for all models, perhaps dropping the normal density plot?
Only the predictions, not the range, right? (i.e. not what would be returned when check_predictions(model, check_range = TRUE)
, because that would be too much for such little space?)
@strengejacke TODO: use same colours for binned residuals as in other plots (e.g. blue/red like in VIF)
Only the predictions, not the range, right?
Yeah I agree.
I would say put predictions first, followed by linearity, homogeneity, outliers, normality, then multicollinearity
I really don't like multicollinearity checks at all, but can understand it would controversial to remove this plot from the array 😜
What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.
What about using DHARMa and it’s quartile residuals approach? The qqunif plot that results is very intuitive if you are used to qqplots.
Do you mean for binomial models?
Yes? Or any glm. Also meant quantile not quartile. Stupid autocorrect. Ha!
library(performance)
model <- glm(vs ~ wt + mpg, data = mtcars, family = "binomial")
check_model(model)
m <- lm(mpg ~ wt + cyl + gear + disp, data = mtcars)
check_model(m)
Created on 2022-02-25 by the reprex package (v2.0.1)
@bwiernik should we change the labs for the pp-check panel? First, y and yrep don't sound clear to me. Second, if colours are changed, lines are probably no longer blue and grey. And there's no y and x axis title, leading to empty spaces (which is, however, aligned with the other plots).
Yeah, I can clarify the labels to match the other plots style
For non-bernoulli models, we should include a plot for checking overdispersion.
We don't have such a plot right now...
(comment moved to https://github.com/easystats/performance/issues/382)
We don't have such a plot right now...
I've seen 3 types of plots for diagnosing over dispersion. One plots conditional mean against conditional variance directly:
Another uses the typical scale-location plot. For this, we could just change the title of the homogeneity plot when it is used for a relevant GLM.
The other is DHARMa, which uses simulations to make qq plots and predicted vs residual plots:
Thoughts?
Thoughts on the above 3 plots? @strengejacke @DominiqueMakowski @IndrajeetPatil @mattansb @vincentarelbundock
For me, the first looks most intuitive
I like both the first as well as the QQ plot. The residual v. predicted with 3 lines has never really made sense to me.
You didn't ask me, but:
var ~ mean
), NB1 (var = c*mean, c>1
), NB2 (var = mu*(1+mu/k), k>0
), Gamma/log-normal (var = c*mean^2
), ... ? It would be even better if it were on a log-log scale so we could compare the power law of the mean-variance scaling to various reference lines (i.e. var prop to mean, prop to mean^2, etc.). (This discussion reminds me of the Cullen and Frey skew-kurtosis graph illustrated here (fitdistrplus::descdist()
), which I find interesting but not actually very useful ...
The qqplots can be useful for overdispersion because that shape with fat tails is a characteristic pattern of overdispersion. That said, I'm really not a fan of qqplots in general, and I agree that other plots like posterior predictive densities and fitted-residuals plots are probably more useful
I also like the first one (:
I don't think that (most people) can diagnose dispersion by looking a a qq-plot, but maybe that's just me (:
@bwiernik How can we create the first plot variance versus expected)? We may include this in the forthcoming update.
tl;dr I think we should go with the second plot below. What do you think?
Okay, so I looked into what SAS was doing with that first plot. Super disappointing. It's actually just plotting the model-predicted mean agains the model-predicted variance (so for a Poisson model, always a straight line). So there's no actual observed data there, only predictions. I wouldn't even call it a diagnostic plot -- you need to have already fit a model with over dispersion before it will confirm that, yes, there was over dispersion that needed to handled.
So I played around and came up with two ideas
In the top plot, we show the standardized residuals. These should follow a normal distribution, so most points should be within ±2 and almost no points should be outside ±4. There is a pretty big number larger that +2.
The second plot takes the squared residuals and fits a geom_smooth()
to them. This gives us the observed average squared residual (i.e., variance) for each Predicted value. It then shows the comparison against the model-predicted variance. In this plot, we see that the observed variance is much higher than expected at a bunch of ranges.
I think the second plot is probably better. It doesn't have individual points, but it is still based on the actual observed residuals, and I think it more clearly indicates the violation of the assumption.
What do you think?
Here is an example of the above data with a model better handling the dispersion (negative binomial, zero-inflated, with a variable dispersion model):
@bbolker Either of the above require computation of Pearson residuals and/or model-expected conditional variance. Do you think that a function could be added for the conditional variance in glmmTMB for the various count families, including with zero-inflation and variable dispersion (e.g., for nbinom2, V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob)
)?
2nd plot looks great, and is easy to understand even for non-experts. Can you share some code, so I can implement it in performance (and see), or do you want to do that?
I'll through some code up
docvisit <- readr::read_table(file.path("docvisit.txt"))
docvisit
library(glmmTMB)
mp <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = poisson()
)
mnb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = nbinom2()
)
mzip <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = poisson()
)
mzinb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = nbinom2()
)
mzinbd <- glmmTMB(
doctorco ~ sex + illness + income + hscore + age,
ziformula = ~ sex + illness + income + hscore + age,
dispformula = ~ sex + illness + income + hscore + age,
data = docvisit,
family = nbinom2()
)
ep <- modelbased::estimate_expectation(mp) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = 0,
Prob = 0,
V = Predicted,
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = residuals(mp, type = "pearson")
)
enb <- modelbased::estimate_expectation(mnb) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = predict(mnb, type = "disp"),
Prob = 0,
V = Predicted * (1 + Predicted / Disp),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = residuals(mnb, type = "pearson")
)
ezip <- modelbased::estimate_expectation(mzip) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = 0
Prob = predict(mzip, type = "zprob"),
V = Predicted * (1 - Prob) * (1 + Predicted * Prob),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = Residuals / sqrt(V)
)
ezinb <- modelbased::estimate_expectation(mzinb) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = predict(mzinb, type = "disp"),
Prob = predict(mzinb, type = "zprob"),
V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = Residuals / sqrt(V)
)
ezinbd <- modelbased::estimate_expectation(mzinbd) |>
dplyr::mutate(
Residuals = -1 * Residuals,
Res2 = Residuals^2,
Disp = predict(mzinbd, type = "disp"),
Prob = predict(mzinbd, type = "zprob"),
V = Predicted * (1 + Predicted / Disp) * (1 - Prob) * (1 + Predicted * (1 + Predicted / Disp) * Prob),
PredV = predict(lm(Res2 ~ poly(Predicted, 2))),
StdRes = Residuals / sqrt(V)
)
p1 <- ggplot(ezinbd) + aes(x = Predicted) +
geom_point(aes(y = StdRes)) +
geom_hline(
yintercept = c(-2, 2, -4, 4),
linetype = c("solid", "solid", "dashed", "dashed"),
color = c(rep(see::social_colors("green"), 2), rep(see::social_colors("blue"), 2))
) +
labs(
title = "Overdispersion and zero-inflation",
subtitle = "Most points should be within ±2 (green), few points outside ±4 (blue)",
x = "Predicted mean",
y = "Standardized resiuduals"
) +
see::theme_modern(plot.title.space = 2)
p2 <- ggplot(ezinbd) + aes(x = Predicted) +
geom_smooth(aes(y = V), color = see::social_colors("blue"), se = FALSE) +
geom_smooth(aes(y = Res2), color = see::social_colors("green")) +
labs(
title = "Overdispersion and zero-inflation",
subtitle = "Observed residual variance (green) should follow predicted residual variance (blue)",
x = "Predicted mean",
y = "Residual variance"
) +
see::theme_modern(plot.title.space = 2)
p1 / p2
ok, first model types are supported (missing will be added later). Some tweaking necessary, e.g. correct colors:
library(glmmTMB)
library(performance)
data("Salamanders")
m <- glmmTMB(
count ~ mined + spp + (1 | site),
family = poisson,
data = Salamanders
)
check_model(m)
#> `check_outliers()` does not yet support models of class glmmTMB.
Created on 2022-03-16 by the reprex package (v2.0.1)
Awesome. Can you link to where this is implemented?
library(readr)
docvisit <- read_table2("C:/Users/DL/Downloads/docvisit.txt")
#> Warning: `read_table2()` was deprecated in readr 2.0.0.
#> Please use `read_table()` instead.
#>
#> -- Column specification --------------------------------------------------------
#> cols(
#> sex = col_double(),
#> age = col_double(),
#> agesq = col_double(),
#> income = col_double(),
#> levyplus = col_double(),
#> freepoor = col_double(),
#> freerepa = col_double(),
#> illness = col_double(),
#> actdays = col_double(),
#> hscore = col_double(),
#> chcond1 = col_double(),
#> chcond2 = col_double(),
#> doctorco = col_double(),
#> nondocco = col_double(),
#> hospadmi = col_double(),
#> hospad = col_double(),
#> medicine = col_double(),
#> prescrib = col_double(),
#> nonpresc = col_double()
#> )
library(glmmTMB)
#> Warning: package 'glmmTMB' was built under R version 4.1.3
library(performance)
mp <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = poisson()
)
plot(check_overdispersion(mp))
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
mnb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
data = docvisit,
family = nbinom2()
)
plot(check_overdispersion(mnb))
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
mzip <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = poisson()
)
plot(check_overdispersion(mzip))
#> Warning: Cannot test for overdispersion, because pearson residuals are not implemented
#> for models with zero-inflation or variable dispersion.
#> Only the visual inspection using 'plot(check_overdispersion(model))' is
#> possible.
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
mzinb <- glmmTMB(
doctorco ~ sex + illness + income + hscore,
ziformula = ~ age,
data = docvisit,
family = nbinom2()
)
plot(check_overdispersion(mzinb))
#> Warning: Cannot test for overdispersion, because pearson residuals are not implemented
#> for models with zero-inflation or variable dispersion.
#> Only the visual inspection using 'plot(check_overdispersion(model))' is
#> possible.
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
mzinbd <- glmmTMB(
doctorco ~ sex + illness + income + hscore + age,
ziformula = ~ sex + illness + income + hscore + age,
dispformula = ~ sex + illness + income + hscore + age,
data = docvisit,
family = nbinom2()
)
plot(check_overdispersion(mzinbd))
#> Warning: Cannot test for overdispersion, because pearson residuals are not implemented
#> for models with zero-inflation or variable dispersion.
#> Only the visual inspection using 'plot(check_overdispersion(model))' is
#> possible.
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
#> `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Created on 2022-03-16 by the reprex package (v2.0.1)
Awesome. Can you link to where this is implemented?
This works for check_model()
, not yet for plot()
(because it requires the forthcoming update from see on CRAN). I'll commit the latest changes to working PR branches, so you can reproduce the plots above.
Ok, you need the latest insight from GitHub/r-universe, performance from this PR (https://github.com/easystats/performance/pull/404), and latest see from GitHub / r-universe. Then, the above examples work. If you install performance from master instead of PR, "only" the check_model()
including overdispersion plots work.
In future versions, we'll be able to have both plot-types for overdispersion...
library(glmmTMB)
library(performance)
data("Salamanders")
m <- glmmTMB(
count ~ mined + spp + (1 | site),
family = poisson,
data = Salamanders
)
check_model(m)
check_model(m, plot_type = 2)
This design is somewhat a bit at odds with our traditional opinionated API, rather than having different plot_types
, I'd just pick one version which we think it's the best and stick with it (until the next improvement)... or, if they show different stuff, have the two overdispersion plots side by side in the corner (subplots)?
Yeah, I think for now we stick to plot 1, the default. I just implemented that option, which is undocumented for now, bnut we could make it in a similar way as with normality plots, where we have 3 different types: https://easystats.github.io/see/articles/performance.html#check-for-normal-distributed-residuals
The current selection of plots returned by
check_model()
for GLMs aren't ideal in a few ways.1. They are missing a linearity check (fitted vs residuals). For binomial models, this should be a called tobinned_residuals()
. For other families, the standard check is fine.2. For binomial models, the constant variance plot should be omitted.4. For non-bernoulli models, we should include a plot for checking overdispersion.For the latter few points, the DHARMa package provides an easy-to-interpret approach for checking distributional assumptions from qq plots and problems with fitted vs residual plots using quantile residuals. We might consider soft-importing DHARMa or re-implementing those approaches. https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html https://github.com/florianhartig/DHARMa/issues/33