Open nahorp opened 4 years ago
Sorry for the delay @nahorp in replying to this.
Let me look into that!
If I understand correctly, this 'bias adjustment' argument only affect GLM(M), i.e., non-linear (mixed) models?
library(lme4)
library(emmeans)
data <- iris
data$Petal.Length_factor <- as.factor(ifelse(data$Petal.Length < 4.2, "A", "B"))
model <- lme4::lmer(Sepal.Width ~ Species + (1 | Petal.Length_factor), data = data)
emmeans::emmeans(model, "Species", bias.adj = FALSE)
#> Species emmean SE df lower.CL upper.CL
#> setosa 3.60 0.191 1.11 1.676 5.52
#> versicolor 2.73 0.184 1.00 0.387 5.07
#> virginica 2.80 0.191 1.11 0.878 4.73
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
emmeans::emmeans(model, "Species", bias.adj = TRUE)
#> Species emmean SE df lower.CL upper.CL
#> setosa 3.60 0.191 1.11 1.676 5.52
#> versicolor 2.73 0.184 1.00 0.387 5.07
#> virginica 2.80 0.191 1.11 0.878 4.73
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
Created on 2020-05-21 by the reprex package (v0.3.0)
However, it changes for general models when the predictions are made on the response scale.
library(emmeans)
data <- iris
data$Petal.Length_factor <- as.factor(ifelse(data$Petal.Length < 4.2, "A", "B"))
model <- glm(Petal.Length_factor ~ Species, data = data, family="binomial")
emmeans::emmeans(model, "Species", type = "response", bias.adj = FALSE)
#> Species prob SE df asymp.LCL asymp.UCL
#> setosa 0.00 0.00000293 Inf 0.000000 1.000000
#> versicolor 0.62 0.06864401 Inf 0.479635 0.742805
#> virginica 1.00 0.00000293 Inf 0.000000 1.000000
#>
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
emmeans::emmeans(model, "Species", type = "response", bias.adj = TRUE)
#> Species prob SE df asymp.LCL asymp.UCL
#> setosa 0.000000 0.0000036 Inf 0.000000 1.00000
#> versicolor 0.607228 0.0622312 Inf 0.481932 0.72185
#> virginica 1.000000 0.0000036 Inf 0.000000 1.00000
#>
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
#> Bias adjustment applied based on sigma = 0.67212
Created on 2020-05-21 by the reprex package (v0.3.0)
As modelbased
by default returns the means/contrasts on the response scale, my naive question is, should we bias-adjust by default? 🤔 Pinging also @mattansb and @strengejacke
No idea... I've never used this option, and not sure what it really means...
I must say the explanation is quite obscure (truth I was hoping for your light 😁)
I'll look into it the next days, no time today.
If I understand correctly, this 'bias adjustment' argument only affect GLM(M), i.e., non-linear (mixed) models?
That's one aspect of it @DominiqueMakowski. So, the inference I made from reading those links is,
In an ordinary GLM, no bias adjustment is needed, nor is it appropriate, because the link function is just used to define a nonlinear relationship between the actual response mean η and the linear predictor.
Note that, in a generalized linear mixed model, including generalized estimating equations and such, there are additive random components involved, and then bias adjustment becomes appropriate
I think, this is the code to modify the link-inverse function to apply bias-adjustment:
function (link, sigma)
{
if (is.null(sigma))
stop("Must specify 'sigma' to obtain bias-adjusted back transformations",
call. = FALSE)
link$inv = link$linkinv
link$der = link$mu.eta
link$sigma22 = sigma^2/2
link$der2 = function(eta) with(link, 1000 * (der(eta + 5e-04) -
der(eta - 5e-04)))
link$linkinv = function(eta) with(link, inv(eta) + sigma22 *
der2(eta))
link$mu.eta = function(eta) with(link, der(eta) + 1000 *
sigma22 * (der2(eta + 5e-04) - der2(eta - 5e-04)))
link
}
Taking the mixed models example, I'm not sure whether it is more appropriate to rely on insight::get_variance()
?
library(emmeans)
#> Warning: package 'emmeans' was built under R version 3.6.3
require(lme4)
#> Loading required package: lme4
#> Warning: package 'lme4' was built under R version 3.6.3
#> Loading required package: Matrix
cbpp <- transform(cbpp, unit = 1:nrow(cbpp))
cbpp.glmer <- glmer(cbind(incidence, size - incidence) ~ period +
(1 | herd) + (1 | unit),
family = binomial, data = cbpp)
emm <- emmeans(cbpp.glmer, "period")
summary(emm, type = "response")
#> period prob SE df asymp.LCL asymp.UCL
#> 1 0.1824 0.0442 Inf 0.1109 0.2852
#> 2 0.0614 0.0230 Inf 0.0290 0.1252
#> 3 0.0558 0.0220 Inf 0.0254 0.1182
#> 4 0.0334 0.0172 Inf 0.0120 0.0894
#>
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
total_sd <- sqrt(sqrt(0.89107^2 + 0.18396^2))
total_sd2 <- sqrt(insight::get_variance_residual(cbpp.glmer))
summary(emm, type = "response", bias.adjust = TRUE, sigma = total_sd)
#> period prob SE df asymp.LCL asymp.UCL
#> 1 0.2255 0.0464 Inf 0.1458 0.325
#> 2 0.0844 0.0299 Inf 0.0411 0.163
#> 3 0.0771 0.0289 Inf 0.0361 0.154
#> 4 0.0470 0.0235 Inf 0.0172 0.120
#>
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
#> Bias adjustment applied based on sigma = 0.95387
summary(emm, type = "response", bias.adjust = TRUE, sigma = total_sd2)
#> period prob SE df asymp.LCL asymp.UCL
#> 1 0.3758 0.0538 Inf 0.2675 0.464
#> 2 0.1647 0.0538 Inf 0.0833 0.293
#> 3 0.1513 0.0528 Inf 0.0733 0.281
#> 4 0.0948 0.0456 Inf 0.0356 0.226
#>
#> Confidence level used: 0.95
#> Intervals are back-transformed from the logit scale
#> Bias adjustment applied based on sigma = 2.0209
Created on 2020-05-22 by the reprex package (v0.3.0)
Just to clarify, so this bias correction is only useful for non-linear mixed models right?
And also, I'm still having a hard time understanding what does it do intuitively
Just to clarify, so this bias correction is only useful for non-linear mixed models right?
No, if I understood right, you need it for mixed models in general, and for linear models with transformed response variable. It is about the residual variance. Predictions are affected when the response is transformed, because residual variance then is different from the same models without transformation (say, log) of the response. This is quite clear. However, it sems that simple back-transformation of the predicted values (say, exp) is not exact, because the residual error in the predictions of the log-transformed response cannot be easily "re-calculated" when back-transforming the predicted values.
In short: computation of estimated marginal means takes residual variance into account, but you cannot just "exp" the error term when back-transforming log-response, resulting in "biased" back-transformed estimated marginal means. This requires "bias adjustment".
For mixed models, we have residual variance on different levels (because multilevel and so, you know...), and that's why we need some bias correction here as well if we want to take the random effects uncertainty into account for predictions. You can see the differences in this vignette when you look at the results from type = "fe"
and type = "re"
. In ggeffects, only the uncertainty intervals are adjusted, because I did not know how to adjust the estimates (predicted values). Now reading this emmeans vignette shows how to adjust estimates as well, though it looks more like a "hack" than a "trustworthy" approach (due to the multiplication with 1000)...
all in all, it sounds like a good thing, so do you think we should set bias.adj
to TRUE by default? or should the fact that it's not the default in emmeans be a hint and rise caution?
sigma()
is not well defined for "most" mixed models, e.g. models with more complex random effects structures (like in the emmeans-vignette-example), thus I would suggest using insight::get_variance()
to get the residual variance. This, however, may take some time for larger models.
Therefor, I would set bias adjustment to FALSE by default, and add a message like 'consider bias adjustment' whenever we have a transformed response or a mixed model.
okay, and should we (re)suggest to Russel to consider using get_variance()?
No, I already did (see https://cran.r-project.org/web/packages/emmeans/vignettes/predictions.html#sd-estimate). He was open to the suggestion, but I think he is hesitant because he's not certain what get_variance()
internally does... I share this concern, because even I myself am not certain 😬 The initial code was written by Ben Bolker, and I asked him if I could use this in my package, and I only added functionality for more packages. The main computation of variances (except for zero-inflated negative binomial and zero-inflated poisson models) was all written by Ben.
@strengejacke your explanation is excellent! Thanks!
Some references, not sure if these are useful https://www.jstor.org/stable/2669622?seq=1 https://www.semanticscholar.org/paper/Correction-of-Retransformation-Bias-in-Nonlinear-on-Liu-Mc/69a23b65647adfa635e18d4fafd2145ba51f2f5b
Any changes in the status of this?
no, not yet.
If it is any help, I'm not sure 'bias' is the optimal description here. In general:
mean(Y) != ginv(mean(g(y)))
as a specific example:
mean(Y) != exp(mean(log(Y)))
this is not really 'bias' per se --- exp(mean(log(Y)))
is a meaningful and accurate quantity, it is just not the arithmetic mean of Y.
Anytime there are covariates and even without covariates in mixed models because of the random effects, predictions are always conditional, unless you go out of your way to get population average estimates marginalising over the conditioning factors, see for example: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-015-0046-6
See also the example of Ben between gee and glmer (i.e. between population averaged and subject specific) here: https://github.com/strengejacke/ggeffects/issues/215
The problem of averaging on the link scale being the default in most packages has led me to implement my own routines for predicting representative marginal responses from (especially logistic regression) models. My typical example/problem with the default behavior of averaging on the link scale is best described with a pratctical example
Assume we have a dataset with two groups of the same size, group A has 50-50 chance to experience an outcome, group B has a near 0 chance to experience that. A back of the envelope computation for the results in the group gives~25% of that outcome. However if we predict the marginal mean response for that group whihle averaging on link scale we may get much much lower values, because on the link scale 0 maps to -Infinity and 50% maps to 0. so the mean will be -Infinity ("arbitrarily small"). Applying the link to that will give me a perdiction of 0% outcome in the combined population, which is obviously false.
Here is an example (overlaid with proportions of the binned observed data, see #120 )
test <- tibble(
cat = factor(sample(c('A','B'),400,replace=TRUE)),
con = exp(rnorm(400,0,1)),
)%>%
mutate(
lp = 0.4*con-if_else(cat=='A',10,0),
pr = pmax(0.02,plogis(lp))
)%>%
mutate(
resp=map_int(pr,~rbernoulli(1,p=.x))
)
mod <- glm(resp~con+cat,data=test,family='binomial')
plot_logreg(mod,terms=c('con [n=40]'),transform_then_average=TRUE)+
geom_line(data=tibble(ggeffects::ggemmeans(mod,terms='con [n=40]')),aes(col='ggeffects',x=x,y=predicted,ymin=conf.low,ymax=conf.high))
One solution is of course to separate the predictions, but sometimes we may want to see the effect of another predictor on the total population marginalzed over the two groups.
I am wondering if "bias adjustment" adresses that. I think the "regrid" function in emmeans was supposed to adress this, but I found it difficult to make that work consistently.
I don't have much to add to the discussion but was led here from the other thread, and wanted to react to:
all in all, it sounds like a good thing, so do you think we should set bias.adj to TRUE by default? or should the fact that it's not the default in emmeans be a hint and rise caution?
I would definitely not make something the default unless we are 100% sure it is correct and we understand it very well.
I agree.
From Russ Lenth's docs:
In an ordinary GLM, no bias adjustment is needed, nor is it appropriate, because the link function is just used to define a nonlinear relationship between the actual response mean η and the linear predictor. That is, the back-transformed parameter is already the mean. [...] Note that, in a generalized linear mixed model, including generalized estimating equations and such, there are additive random components involved, and then bias adjustment becomes appropriate.
So it is only appropriate with transformed outcomes or with generalized mixed models (but not with linear mixed or with genialized fixed only).
For linear mixed is required as well, I'd say, because you want to adjust for RE variance as well. See my answer here: https://github.com/easystats/modelbased/issues/57#issuecomment-632588471
Accounting for the random effect's uncertainty in a linear model does not affect the prediction (only it's uncertainty). And so that is not a bias adjustment.
library(lme4)
#> Loading required package: Matrix
library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.1.3
m_mixed <- lmer(Reaction ~ Days + (1|Subject),
data = sleepstudy)
emmeans(m_mixed, ~ Days, cov.red = range)
#> Days emmean SE df lower.CL upper.CL
#> 0 251 9.75 22.8 231 272
#> 9 346 9.75 22.8 325 366
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
# Same predictions (also CIs...)
Xrtra_disp <- sqrt(insight::get_variance(m_mixed)[["var.intercept"]])
emmeans(m_mixed, ~ Days, cov.red = range, bias.adjust = TRUE, sigma = Xrtra_disp)
#> Days emmean SE df lower.CL upper.CL
#> 0 251 9.75 22.8 231 272
#> 9 346 9.75 22.8 325 366
#>
#> Degrees-of-freedom method: kenward-roger
#> Confidence level used: 0.95
Created on 2022-06-25 by the reprex package (v2.0.1)
In a generalized linear mixed model, you do need to bias adjust in order to account for the bias from the fact that the mean random effect is not the population random effect (inv.link(mean(eta)) != E[mu]):
library(lme4)
#> Loading required package: Matrix
library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.1.3
m_fixed <- glm(cbind(incidence, size - incidence) ~ period,
family = binomial, data = cbpp)
cbpp <- transform(cbpp, unit = 1:nrow(cbpp))
m_mixed <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd) + (1|unit),
family = binomial, data = cbpp)
emmeans(m_mixed, ~ period, regrid = "response")
#> period prob SE df asymp.LCL asymp.UCL
#> 1 0.1824 0.0442 Inf 0.095664 0.2691
#> 2 0.0614 0.0230 Inf 0.016308 0.1065
#> 3 0.0558 0.0220 Inf 0.012628 0.0989
#> 4 0.0334 0.0172 Inf -0.000374 0.0671
#>
#> Confidence level used: 0.95
Xrtra_disp <- sqrt(sum(insight::get_variance(m_mixed)[["var.intercept"]]))
emmeans(m_mixed, ~ period, regrid = "response", bias.adjust = TRUE, sigma = Xrtra_disp)
#> period prob SE df asymp.LCL asymp.UCL
#> 1 0.2216 0.0462 Inf 0.131094 0.3121
#> 2 0.0823 0.0292 Inf 0.025024 0.1397
#> 3 0.0751 0.0282 Inf 0.019779 0.1305
#> 4 0.0458 0.0230 Inf 0.000822 0.0908
#>
#> Confidence level used: 0.95
# These (true unbiased predictions) are closer to the bias adjusted predictions
emmeans(m_fixed, ~ period, regrid = "response")
#> period prob SE df asymp.LCL asymp.UCL
#> 1 0.2194 0.0248 Inf 0.1708 0.2681
#> 2 0.0802 0.0187 Inf 0.0436 0.1167
#> 3 0.0711 0.0183 Inf 0.0352 0.1069
#> 4 0.0452 0.0167 Inf 0.0125 0.0779
#>
#> Confidence level used: 0.95
Created on 2022-06-25 by the reprex package (v2.0.1)
Maybe the answer I contributed to this question in the CrossValidated forum will address some of these concerns -- or maybe not...
Hello,
I came across bias adjustment for frequentist models involving response transformations and/or random effects while reading the detailed vignette in the emmeans package and wondered if the same applied to Bayesian models as well. Indeed, emmeans have a small section on Bayesian models.
It would be great if these could be ported over to the relevant functions (
estimate_means
andestimate_contrasts
) here. Going a step further, it would be even better if thesigma
argument could be calculated automatically for common model objects from popular Bayesian R packages (such asrstanarm
,brms
,BayesFactor
)Thank you