easystats / modelbased

:chart_with_upwards_trend: Estimate effects, contrasts and means based on statistical models
https://easystats.github.io/modelbased/
GNU General Public License v3.0
234 stars 19 forks source link

'Bias adjusted' centrality and uncertainty measures for models with response transformations and/or random effects #57

Open nahorp opened 4 years ago

nahorp commented 4 years ago

Hello,

I came across bias adjustment for frequentist models involving response transformations and/or random effects while reading the detailed vignette in the emmeans package and wondered if the same applied to Bayesian models as well. Indeed, emmeans have a small section on Bayesian models.

It would be great if these could be ported over to the relevant functions (estimate_means and estimate_contrasts) here. Going a step further, it would be even better if the sigma argument could be calculated automatically for common model objects from popular Bayesian R packages (such as rstanarm, brms, BayesFactor)

Thank you

DominiqueMakowski commented 4 years ago

Sorry for the delay @nahorp in replying to this.

Let me look into that!

DominiqueMakowski commented 4 years ago

If I understand correctly, this 'bias adjustment' argument only affect GLM(M), i.e., non-linear (mixed) models?

LMER: same

      library(lme4)
      library(emmeans)

      data <- iris
      data$Petal.Length_factor <- as.factor(ifelse(data$Petal.Length < 4.2, "A", "B"))

      model <- lme4::lmer(Sepal.Width ~ Species + (1 | Petal.Length_factor), data = data)
      emmeans::emmeans(model, "Species", bias.adj = FALSE)
#>  Species    emmean    SE   df lower.CL upper.CL
#>  setosa       3.60 0.191 1.11    1.676     5.52
#>  versicolor   2.73 0.184 1.00    0.387     5.07
#>  virginica    2.80 0.191 1.11    0.878     4.73
#> 
#> Degrees-of-freedom method: kenward-roger 
#> Confidence level used: 0.95
      emmeans::emmeans(model, "Species", bias.adj = TRUE)
#>  Species    emmean    SE   df lower.CL upper.CL
#>  setosa       3.60 0.191 1.11    1.676     5.52
#>  versicolor   2.73 0.184 1.00    0.387     5.07
#>  virginica    2.80 0.191 1.11    0.878     4.73
#> 
#> Degrees-of-freedom method: kenward-roger 
#> Confidence level used: 0.95

Created on 2020-05-21 by the reprex package (v0.3.0)

However, it changes for general models when the predictions are made on the response scale.

GLM

    library(emmeans)

    data <- iris
    data$Petal.Length_factor <- as.factor(ifelse(data$Petal.Length < 4.2, "A", "B"))

    model <- glm(Petal.Length_factor ~ Species, data = data, family="binomial")

    emmeans::emmeans(model, "Species", type = "response", bias.adj = FALSE)
#>  Species    prob         SE  df asymp.LCL asymp.UCL
#>  setosa     0.00 0.00000293 Inf  0.000000  1.000000
#>  versicolor 0.62 0.06864401 Inf  0.479635  0.742805
#>  virginica  1.00 0.00000293 Inf  0.000000  1.000000
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale
    emmeans::emmeans(model, "Species", type = "response", bias.adj = TRUE)
#>  Species        prob        SE  df asymp.LCL asymp.UCL
#>  setosa     0.000000 0.0000036 Inf  0.000000   1.00000
#>  versicolor 0.607228 0.0622312 Inf  0.481932   0.72185
#>  virginica  1.000000 0.0000036 Inf  0.000000   1.00000
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale 
#> Bias adjustment applied based on sigma = 0.67212

Created on 2020-05-21 by the reprex package (v0.3.0)

As modelbased by default returns the means/contrasts on the response scale, my naive question is, should we bias-adjust by default? 🤔 Pinging also @mattansb and @strengejacke

mattansb commented 4 years ago

No idea... I've never used this option, and not sure what it really means...

DominiqueMakowski commented 4 years ago

I must say the explanation is quite obscure (truth I was hoping for your light 😁)

strengejacke commented 4 years ago

I'll look into it the next days, no time today.

nahorp commented 4 years ago

If I understand correctly, this 'bias adjustment' argument only affect GLM(M), i.e., non-linear (mixed) models?

That's one aspect of it @DominiqueMakowski. So, the inference I made from reading those links is,

  1. Anything with 'random' effects (i.e., mixed models) are affected (i.e., LMMs and GLMMs)
  2. Anything where the DV is transformed (i.e., response transformation) outside of the use of the link function is affected.
  3. Thus, LMs and GLMs without response transformations are the only models not affected.
strengejacke commented 4 years ago

In an ordinary GLM, no bias adjustment is needed, nor is it appropriate, because the link function is just used to define a nonlinear relationship between the actual response mean η and the linear predictor.

Note that, in a generalized linear mixed model, including generalized estimating equations and such, there are additive random components involved, and then bias adjustment becomes appropriate

strengejacke commented 4 years ago

I think, this is the code to modify the link-inverse function to apply bias-adjustment:

function (link, sigma) 
{
  if (is.null(sigma)) 
    stop("Must specify 'sigma' to obtain bias-adjusted back transformations", 
      call. = FALSE)
  link$inv = link$linkinv
  link$der = link$mu.eta
  link$sigma22 = sigma^2/2
  link$der2 = function(eta) with(link, 1000 * (der(eta + 5e-04) - 
    der(eta - 5e-04)))
  link$linkinv = function(eta) with(link, inv(eta) + sigma22 * 
    der2(eta))
  link$mu.eta = function(eta) with(link, der(eta) + 1000 * 
    sigma22 * (der2(eta + 5e-04) - der2(eta - 5e-04)))
  link
}
strengejacke commented 4 years ago

Taking the mixed models example, I'm not sure whether it is more appropriate to rely on insight::get_variance()?

library(emmeans)
#> Warning: package 'emmeans' was built under R version 3.6.3
require(lme4)
#> Loading required package: lme4
#> Warning: package 'lme4' was built under R version 3.6.3
#> Loading required package: Matrix
cbpp <- transform(cbpp, unit = 1:nrow(cbpp))
cbpp.glmer <- glmer(cbind(incidence, size - incidence) ~ period + 
                      (1 | herd) +  (1 | unit),
                    family = binomial, data = cbpp)

emm <- emmeans(cbpp.glmer, "period")
summary(emm, type = "response")
#>  period   prob     SE  df asymp.LCL asymp.UCL
#>  1      0.1824 0.0442 Inf    0.1109    0.2852
#>  2      0.0614 0.0230 Inf    0.0290    0.1252
#>  3      0.0558 0.0220 Inf    0.0254    0.1182
#>  4      0.0334 0.0172 Inf    0.0120    0.0894
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale

total_sd <- sqrt(sqrt(0.89107^2 + 0.18396^2))
total_sd2 <- sqrt(insight::get_variance_residual(cbpp.glmer))

summary(emm, type = "response", bias.adjust = TRUE, sigma = total_sd)
#>  period   prob     SE  df asymp.LCL asymp.UCL
#>  1      0.2255 0.0464 Inf    0.1458     0.325
#>  2      0.0844 0.0299 Inf    0.0411     0.163
#>  3      0.0771 0.0289 Inf    0.0361     0.154
#>  4      0.0470 0.0235 Inf    0.0172     0.120
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale 
#> Bias adjustment applied based on sigma = 0.95387
summary(emm, type = "response", bias.adjust = TRUE, sigma = total_sd2)
#>  period   prob     SE  df asymp.LCL asymp.UCL
#>  1      0.3758 0.0538 Inf    0.2675     0.464
#>  2      0.1647 0.0538 Inf    0.0833     0.293
#>  3      0.1513 0.0528 Inf    0.0733     0.281
#>  4      0.0948 0.0456 Inf    0.0356     0.226
#> 
#> Confidence level used: 0.95 
#> Intervals are back-transformed from the logit scale 
#> Bias adjustment applied based on sigma = 2.0209

Created on 2020-05-22 by the reprex package (v0.3.0)

DominiqueMakowski commented 4 years ago

Just to clarify, so this bias correction is only useful for non-linear mixed models right?

And also, I'm still having a hard time understanding what does it do intuitively

strengejacke commented 4 years ago

Just to clarify, so this bias correction is only useful for non-linear mixed models right?

No, if I understood right, you need it for mixed models in general, and for linear models with transformed response variable. It is about the residual variance. Predictions are affected when the response is transformed, because residual variance then is different from the same models without transformation (say, log) of the response. This is quite clear. However, it sems that simple back-transformation of the predicted values (say, exp) is not exact, because the residual error in the predictions of the log-transformed response cannot be easily "re-calculated" when back-transforming the predicted values.

In short: computation of estimated marginal means takes residual variance into account, but you cannot just "exp" the error term when back-transforming log-response, resulting in "biased" back-transformed estimated marginal means. This requires "bias adjustment".

For mixed models, we have residual variance on different levels (because multilevel and so, you know...), and that's why we need some bias correction here as well if we want to take the random effects uncertainty into account for predictions. You can see the differences in this vignette when you look at the results from type = "fe" and type = "re". In ggeffects, only the uncertainty intervals are adjusted, because I did not know how to adjust the estimates (predicted values). Now reading this emmeans vignette shows how to adjust estimates as well, though it looks more like a "hack" than a "trustworthy" approach (due to the multiplication with 1000)...

DominiqueMakowski commented 4 years ago

all in all, it sounds like a good thing, so do you think we should set bias.adj to TRUE by default? or should the fact that it's not the default in emmeans be a hint and rise caution?

strengejacke commented 4 years ago

sigma() is not well defined for "most" mixed models, e.g. models with more complex random effects structures (like in the emmeans-vignette-example), thus I would suggest using insight::get_variance() to get the residual variance. This, however, may take some time for larger models.

Therefor, I would set bias adjustment to FALSE by default, and add a message like 'consider bias adjustment' whenever we have a transformed response or a mixed model.

DominiqueMakowski commented 4 years ago

okay, and should we (re)suggest to Russel to consider using get_variance()?

strengejacke commented 4 years ago

No, I already did (see https://cran.r-project.org/web/packages/emmeans/vignettes/predictions.html#sd-estimate). He was open to the suggestion, but I think he is hesitant because he's not certain what get_variance() internally does... I share this concern, because even I myself am not certain 😬 The initial code was written by Ben Bolker, and I asked him if I could use this in my package, and I only added functionality for more packages. The main computation of variances (except for zero-inflated negative binomial and zero-inflated poisson models) was all written by Ben.

mattansb commented 4 years ago

@strengejacke your explanation is excellent! Thanks!

strengejacke commented 4 years ago

Some references, not sure if these are useful https://www.jstor.org/stable/2669622?seq=1 https://www.semanticscholar.org/paper/Correction-of-Retransformation-Bias-in-Nonlinear-on-Liu-Mc/69a23b65647adfa635e18d4fafd2145ba51f2f5b

DominiqueMakowski commented 3 years ago

Any changes in the status of this?

strengejacke commented 3 years ago

no, not yet.

JWiley commented 2 years ago

If it is any help, I'm not sure 'bias' is the optimal description here. In general: mean(Y) != ginv(mean(g(y))) as a specific example: mean(Y) != exp(mean(log(Y))) this is not really 'bias' per se --- exp(mean(log(Y))) is a meaningful and accurate quantity, it is just not the arithmetic mean of Y.

Anytime there are covariates and even without covariates in mixed models because of the random effects, predictions are always conditional, unless you go out of your way to get population average estimates marginalising over the conditioning factors, see for example: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-015-0046-6

strengejacke commented 2 years ago

See also the example of Ben between gee and glmer (i.e. between population averaged and subject specific) here: https://github.com/strengejacke/ggeffects/issues/215

sven-stodtmann commented 2 years ago

The problem of averaging on the link scale being the default in most packages has led me to implement my own routines for predicting representative marginal responses from (especially logistic regression) models. My typical example/problem with the default behavior of averaging on the link scale is best described with a pratctical example

Assume we have a dataset with two groups of the same size, group A has 50-50 chance to experience an outcome, group B has a near 0 chance to experience that. A back of the envelope computation for the results in the group gives~25% of that outcome. However if we predict the marginal mean response for that group whihle averaging on link scale we may get much much lower values, because on the link scale 0 maps to -Infinity and 50% maps to 0. so the mean will be -Infinity ("arbitrarily small"). Applying the link to that will give me a perdiction of 0% outcome in the combined population, which is obviously false.

Here is an example (overlaid with proportions of the binned observed data, see #120 )

test <- tibble(
    cat  = factor(sample(c('A','B'),400,replace=TRUE)),
    con  = exp(rnorm(400,0,1)),
  )%>%
  mutate(
    lp = 0.4*con-if_else(cat=='A',10,0),
    pr = pmax(0.02,plogis(lp))
  )%>%
  mutate(
    resp=map_int(pr,~rbernoulli(1,p=.x))
  )
mod <- glm(resp~con+cat,data=test,family='binomial')
plot_logreg(mod,terms=c('con [n=40]'),transform_then_average=TRUE)+
  geom_line(data=tibble(ggeffects::ggemmeans(mod,terms='con [n=40]')),aes(col='ggeffects',x=x,y=predicted,ymin=conf.low,ymax=conf.high))

image

One solution is of course to separate the predictions, but sometimes we may want to see the effect of another predictor on the total population marginalzed over the two groups.

I am wondering if "bias adjustment" adresses that. I think the "regrid" function in emmeans was supposed to adress this, but I found it difficult to make that work consistently.

vincentarelbundock commented 2 years ago

I don't have much to add to the discussion but was led here from the other thread, and wanted to react to:

all in all, it sounds like a good thing, so do you think we should set bias.adj to TRUE by default? or should the fact that it's not the default in emmeans be a hint and rise caution?

I would definitely not make something the default unless we are 100% sure it is correct and we understand it very well.

mattansb commented 2 years ago

I agree.

From Russ Lenth's docs:

In an ordinary GLM, no bias adjustment is needed, nor is it appropriate, because the link function is just used to define a nonlinear relationship between the actual response mean η and the linear predictor. That is, the back-transformed parameter is already the mean. [...] Note that, in a generalized linear mixed model, including generalized estimating equations and such, there are additive random components involved, and then bias adjustment becomes appropriate.

So it is only appropriate with transformed outcomes or with generalized mixed models (but not with linear mixed or with genialized fixed only).

strengejacke commented 2 years ago

For linear mixed is required as well, I'd say, because you want to adjust for RE variance as well. See my answer here: https://github.com/easystats/modelbased/issues/57#issuecomment-632588471

mattansb commented 2 years ago

Accounting for the random effect's uncertainty in a linear model does not affect the prediction (only it's uncertainty). And so that is not a bias adjustment.

library(lme4)
#> Loading required package: Matrix
library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.1.3

m_mixed <- lmer(Reaction ~ Days + (1|Subject),
                 data = sleepstudy)

emmeans(m_mixed, ~ Days, cov.red = range)
#>  Days emmean   SE   df lower.CL upper.CL
#>     0    251 9.75 22.8      231      272
#>     9    346 9.75 22.8      325      366
#> 
#> Degrees-of-freedom method: kenward-roger 
#> Confidence level used: 0.95

# Same predictions (also CIs...)
Xrtra_disp <- sqrt(insight::get_variance(m_mixed)[["var.intercept"]])
emmeans(m_mixed, ~ Days, cov.red = range, bias.adjust = TRUE, sigma = Xrtra_disp)
#>  Days emmean   SE   df lower.CL upper.CL
#>     0    251 9.75 22.8      231      272
#>     9    346 9.75 22.8      325      366
#> 
#> Degrees-of-freedom method: kenward-roger 
#> Confidence level used: 0.95

Created on 2022-06-25 by the reprex package (v2.0.1)

In a generalized linear mixed model, you do need to bias adjust in order to account for the bias from the fact that the mean random effect is not the population random effect (inv.link(mean(eta)) != E[mu]):

library(lme4)
#> Loading required package: Matrix
library(emmeans)
#> Warning: package 'emmeans' was built under R version 4.1.3

m_fixed <- glm(cbind(incidence, size - incidence) ~ period,
               family = binomial, data = cbpp)

cbpp <- transform(cbpp, unit = 1:nrow(cbpp))
m_mixed <- glmer(cbind(incidence, size - incidence) ~ period + (1 | herd) + (1|unit),
                 family = binomial, data = cbpp)

emmeans(m_mixed, ~ period, regrid = "response")
#>  period   prob     SE  df asymp.LCL asymp.UCL
#>  1      0.1824 0.0442 Inf  0.095664    0.2691
#>  2      0.0614 0.0230 Inf  0.016308    0.1065
#>  3      0.0558 0.0220 Inf  0.012628    0.0989
#>  4      0.0334 0.0172 Inf -0.000374    0.0671
#> 
#> Confidence level used: 0.95

Xrtra_disp <- sqrt(sum(insight::get_variance(m_mixed)[["var.intercept"]]))
emmeans(m_mixed, ~ period, regrid = "response", bias.adjust = TRUE, sigma = Xrtra_disp)
#>  period   prob     SE  df asymp.LCL asymp.UCL
#>  1      0.2216 0.0462 Inf  0.131094    0.3121
#>  2      0.0823 0.0292 Inf  0.025024    0.1397
#>  3      0.0751 0.0282 Inf  0.019779    0.1305
#>  4      0.0458 0.0230 Inf  0.000822    0.0908
#> 
#> Confidence level used: 0.95

# These (true unbiased predictions) are closer to the bias adjusted predictions
emmeans(m_fixed, ~ period, regrid = "response")
#>  period   prob     SE  df asymp.LCL asymp.UCL
#>  1      0.2194 0.0248 Inf    0.1708    0.2681
#>  2      0.0802 0.0187 Inf    0.0436    0.1167
#>  3      0.0711 0.0183 Inf    0.0352    0.1069
#>  4      0.0452 0.0167 Inf    0.0125    0.0779
#> 
#> Confidence level used: 0.95

Created on 2022-06-25 by the reprex package (v2.0.1)

rvlenth commented 2 years ago

Maybe the answer I contributed to this question in the CrossValidated forum will address some of these concerns -- or maybe not...