JWiley / brmsmargins

Average marginal estimates for Bayesian models fit with the brms package
https://joshuawiley.com/brmsmargins
Other
20 stars 2 forks source link

Question about difference between #9

Open isabellaghement opened 2 years ago

isabellaghement commented 2 years ago

Dear Joshua,

Thank you very much for developing the brmsmargins package. I am trying to understand a paragraph included in the help file for the brmsmargins() function and would appreciate any clarifications you are able to provide. The paragraph in question is as follows:

"Use the fitted() function to generate predictions based on this modified dataset. If effects is set to “fixedonly” (meaning only generate predictions using fixed effects) or to “includeRE” (meaning generate predictions using fixed and random effects), then predictions are generated entirely using the fitted() function and are, typically back transformed to the response scale. For mixed effects models with fixed and random effects where effects is set to “integrateoutRE”, then fitted() is only used to generate predictions using the fixed effects on the linear scale. For each prediction generated, the random effects are integrated out by drawing k random samples from the model assumed random effect(s) distribution. These are added to the fixed effects predictions, back transformed, and then averaged over all k random samples to perform numerical Monte Carlo integration."

Question 1

Can you confirm whether the last part of the above paragraph, which starts with "For each prediction generated, the random effects are integrated out", applies to each of the settings effects = "fixedonly”, effects = "includeRE" and effects =“integrateoutRE”? Or does it apply only to the effects =“integrateoutRE” setting?

Question 2

Can you confirm whether both effects = "fixedonly” and effects =“integrateoutRE” ignore the random effects (i.e., they set them to 0), with the difference between them being that effects = "fixedonly” returns the results on the natural response scale while effects =“integrateoutRE” on the link scale? (For example, for a mixed effects binary regression model, effects = "fixedonly” returns fitted probabilities whereas effects =“integrateoutRE” returns fitted log odds.)

Question 3

If both effects = "fixedonly” and effects =“integrateoutRE” set the random effects in the fitted part of the model to 0, I do not understand why we would want to "integrate out the random effects" for these two settings. The only setting for which it would make sense to "integrate out the random effects" would be effects = "includeRE" (since that is the only setting where the random effects would NOT be set to 0; unless, perhaps, there are multiple random effects in the model and only some of those would be set to 0?). I don't see a re_formula as an argument to brmsmargins() which would be used to indicate which random effects are set to 0 and which are not, so I am not sure how one would even specify which random effects are set to 0 and which are kept in the model?

Question 4

When would one want to use effects = "includeRE" versus effects =“integrateoutRE”?

Question 5

Getting back to the example of a mixed effects binary logistic regression model (with just a random intercept u for simplicity and 2 predictors, X1 and X2), if I needed to compute something like Prob(Y = 1|X1, X2, u) for X1 and X2 set at 2 relevant values x1 and x2 and u averaged across the observed random intercept values, would there be a way to get this out of brmsmargins? If I then needed to compute Prob(Y = 1|X1, X2, u) but averaged across random realizations of u from the assumed distribution of the random intercepts - say, Normal(0, sigma_u^2) with sigma_u^2 estimated from the data - would there be a way to compute this via brmsmargins? What does the latter estimate? What does the former estimate? Why not use the former instead of the latter (e.g., because we can get a better approximation if we use more random effects than what was observed in the data)?

JWiley commented 2 years ago

Answer 1.

The random effects are only integrated out when effect = "integrateoutRE".

Answer 2.

Both back transformed to the response scale.

Answer 3.

They don't both set the REs to 0, see Answer 2. All REs are integrated out when effects = "integrateoutRE" and none are when effects ="fixedonly"

Answer 4

Answer 5

To get Prob(Y = 1|X1, X2, u) where you have averaged across random realizations of u from the assumed distribution (btw this = integrating over u) use: effects = "integrateoutRE"

I don't have a particularly convenient way to get that estimate for only the "observed" u random intercept values, but you could generate all the predictions using effects = "includeRE" and then average yourself. Incidentally, u is never really observed. If you want estimates of the average for the population on response scale, you want to average over the assumed distribution.

Extra

Not an answer to any specific question, but perhaps relevant here: my only real interest case in writing brmsmargins was/is integrating out REs. Generally I am interested in an estimate and uncertainty around the population average marginal effect. IMHO the marginal effect for an average person is a fine but odd quantity. Many tools exist for single level models. What I found largely lacking was the ability to generate population average predictions and pop average marginal effects from mixed effects models with non linear transforms / links. So that's really what I'm trying to solve here. effects = "fixedonly" and effects = "includeRE" are really only there for testing and because I mean with everything else written why not include those.

isabellaghement commented 2 years ago

Dear Joshua,

Thank you so very much for your clear and informative answers. They are really helpful!

I still have some difficulties understanding your Answer 4. Maybe part of it is because I have bits of information floating through my brain about “broad” and “narrow” inference in the context of random effects models and I am trying to connect your answer to that.

To quote the SAS help website, “in the narrow inference space, conclusions are drawn about the particular values of the random effects selected in the study.”

I guess one could interpret this statement in several ways: 1. We can draw conclusion about a specific random effect (e.g., class # 1 in your example with 20 classes); 2. We can draw conclusions about all random effects in the study (e.g., what is the average effect of some predictor X on the scale of interest across the 20 classes included in the study).

To make this concrete, if our study includes 20 classes (each with multiple student) and Y = student score on a test (measured once for each student; continuous) and X = study time (measured once for each student), then 1. would allow us to determine the effect of X on Y in class # 1. In contrast, 2. would allow us to determine the average effect of X on Y across just the 20 classes included in the study.

What is not clear to me is whether brmsmargins() with type = “effectsRE” computes the effect of X on Y described in 1. or in 2.? Or does it compute the average value of Y (given X) across all the students in class # 1 (say) for 1.? Or does it compute the average value of Y (given X) across all the students in each of the 20 classes and then averages that value across the 20 classes? As you can see, I am really lost in the weeds here (and I haven’t even mentioned the possibility that it may predict the value of Y for a given X for a randomly selected student in class # 1., etc.)

For “broad space inferences”, the 20 classes in your example would be viewed as being representative of a larger universe of classes about which we wish to make inferences. I think here it is clearer to me that we could average a quantity - whether it is an effect on the log odds scale or a probability in the context of a binary mixed effects model - across the entire distribution of random effects. Then that would allow us to compute things like a “marginal coefficient in the broad inference sense” and a “marginal effect in the broad inference sense”, with the “broad sense” taken to mean “on average across all the schools represented by the 20 schools included in our study”.

In my mind, I don’t see any reason why we wouldn’t be able to restrict this averaging to just the distribution of “predicted” random effects in our study (e.g., restrict the averaging to just the 20 schools, when Y = whether or not the student passed the test). This would then allow us to get “marginal coefficients in a narrow sense” and “marginal effects in a narrow sense”, with the “narrow sense” taken to mean “on average across just the 20 schools in our study”.

Am I totally off base here to suggest marginal effects (and marginal coefficients) can be defined either in a “narrow sense” or in a “broad sense”? I have not seen any explicit discussion of this in the literature but in practice it would be helpful to draw such a distinction (if the distinction is warranted).

As a practitioner, I would like the ability to compute both of these types of marginal effects/coefficients (but just don’t have the programming skills to pull it off on my own, unfortunately).

I looked into the marginaleffects package and I can’t tell whether it only does type = “includeRE” (which is my current understanding based on its vignette which compares marginaleffects with brmsmargins). Since I don’t yet fully understand how “includeRE” works in brmsmargins, I struggle to understand what marginaleffects does with the same setting.

Ultimately, I would love it if I could find a way to compute marginal effects and marginal probabilities in a model fitted with the ordbetareg() function in the ordbetareg package. This function is built on top of brms and fits a model to zero-and-one inflated proportions; it fits a single model component at a time, which makes it easier to work with. But it fits effects on a latent scale which is not intuitive, so marginal effects are a must.

Thank you very much for the Extra bonus. That mirrors exactly what I feel except that I don’t have the computational tools to pull off the most sensible quantities from my model (which is frustrating). This is further compounded by not always understanding what else can be computed instead given the currently available R options.

I would very much appreciate any insights/clarifications you can throw my way. These things have been niggling at me for a long time and each time I understand one more nuance, I can start obsessing about some other elusive aspect of statistical practice. 😁

With huge gratitude,

Isabella

vincentarelbundock commented 2 years ago

Thanks again @JWiley for taking so much time to write this package and answer our newbie questions.

I’m still trying to grok the differences between values of the effect argument when computing AMEs. If I understand correctly:

  1. fixedonly makes predictions using re_formula = NA, takes differences, and averages.
  2. includeRE makes predictions using re_formula = NULL, takes differences, and averages.
  3. integrateoutRE draws REs from the population distribution, makes predictions, takes differences, and averages.

Below, I replicate the results produced by brmsmargins exactly under options fixedonly and includeRE. My question:

Do you think it is possible to achieve something like integrateoutRE by differencing and averaging the output of this call?

posterior_epred(model, allow_new_levels=TRUE, sample_new_levels="gaussian")

These two arguments do something which, to me, sounds quite similar to what you describe here:

“sample new levels from the (multivariate) normal distribution implied by the group-level standard deviations and correlations. This options may be useful for conducting Bayesian power analysis or predicting new levels in situations where relatively few levels where observed in the old_data.” [brms docs]

Below, I show an example with raw brms code that matches brmsmargins(effects="integrateoutRE") results very closely (but not exactly).

Simulate data and fit model

library(brms)
library(collapse)
library(data.table)
library(brmsmargins)

h <- .0001

d <- withr::with_seed(
  seed = 12345, code = {
    nGroups <- 100
    nObs <- 20
    theta.location <- matrix(rnorm(nGroups * 2), nrow = nGroups, ncol = 2)
    theta.location[, 1] <- theta.location[, 1] - mean(theta.location[, 1])
    theta.location[, 2] <- theta.location[, 2] - mean(theta.location[, 2])
    theta.location[, 1] <- theta.location[, 1] / sd(theta.location[, 1])
    theta.location[, 2] <- theta.location[, 2] / sd(theta.location[, 2])
    theta.location <- theta.location %*% chol(matrix(c(1.5, -.25, -.25, .5^2), 2))
    theta.location[, 1] <- theta.location[, 1] - 2.5
    theta.location[, 2] <- theta.location[, 2] + 1
    d <- data.table(
      x = rep(rep(0:1, each = nObs / 2), times = nGroups))
    d[, ID := rep(seq_len(nGroups), each = nObs)]

    for (i in seq_len(nGroups)) {
      d[ID == i, y := rbinom(
        n = nObs,
        size = 1,
        prob = plogis(theta.location[i, 1] + theta.location[i, 2] * x))
        ]
    }
    copy(d)
  })

mlogit <- brms::brm(
  y ~ 1 + x + (1 + x | ID), family = "bernoulli",
  data = d, seed = 1234,
  silent = 2, refresh = 0,
  chains = 4L, cores = 4L, backend = "cmdstanr")

3 slightly different predict() functions

The trick in predict_integrateoutRE is that we replace the ID variable by unobserved levels and call posterior_epred with the allow_new_levels=TRUE argument. This makes predictions with REs drawn from the population distribution.

predict_fixedonly <- function(newdata) {
    brms::posterior_epred(
        mlogit,
        newdata = newdata,
        re_formula = NA)
}

predict_includeRE <- function(newdata) {
    brms::posterior_epred(
        mlogit,
        newdata = newdata)
}

predict_integrateoutRE <- function(newdata) {
    # replace known IDs by unknown IDs to draw from the population distribution
    oldID <- unique(newdata$ID)
    newID <- sample(1e6:2e6, length(oldID))
    newdata$ID <- newID[match(newdata$ID, oldID)]
    brms::posterior_epred(
        mlogit,
        allow_new_levels = TRUE,
        sample_new_levels = "gaussian",
        newdata = newdata)
}

Average marginal effects

This function calculates the median of the posterior distribution of average marginal effects. FUN returns a matrix of predictions. Then, we use the collapse package to take differences between predictions and to normalize by dividing by h. Finally, we take the median of AME draws.

ame <- function(newdata, FUN) {
    p <- FUN(newdata)
    p <- t(p)
    # use {collapse} to get group-by column-wise (yhat_{x+h} - yhat_{x}) / h
    mfx <- collapse::BY(
        p,
        g = nd[, c("ID", "x_baseline_01")],
        FUN = \(x) diff(x) / h)
    # average marginal effect
    ame <- colMeans(mfx)
    # median of posterior distribution of AME
    median(ame)
}

nd <- expand.grid(x = c(0, h, 1, 1 + h), ID = 1:100)
nd$x_baseline_01 <- nd$x >= 1

fixedonly: Exact replication

ame(nd, predict_fixedonly)
#> [1] 0.1044999

brmsmargins(
  mlogit,
  add = data.frame(x = c(0, h)),
  contrasts = cbind("AME x" = c(-1 / h, 1 / h)),
  effects = "fixedonly", k = 100L, seed = 1234)$ContrastSummary$Mdn
#> [1] 0.1044999

includeRE: Exact replication

ame(nd, predict_includeRE)
#> [1] 0.1116809

brmsmargins(
  mlogit,
  add = data.frame(x = c(0, h)),
  contrasts = cbind("AME x" = c(-1 / h, 1 / h)),
  effects = "includeRE", k = 100L, seed = 1234)$ContrastSummary$Mdn
#> [1] 0.1116809

integrateoutRE: Close replication

Here, the replication is not exact, perhaps because random draws from the gaussian are not the same.

ame(nd, predict_integrateoutRE)
#> [1] 0.1110594

brmsmargins(
  mlogit,
  add = data.frame(x = c(0, h)),
  contrasts = cbind("AME x" = c(-1 / h, 1 / h)),
  effects = "integrateoutRE", k = 100L, seed = 1234)$ContrastSummary$Mdn
#> [1] 0.1112742
JWiley commented 2 years ago

@isabellaghement

I do not think it is off base or unreasonable to consider narrow or broad inference. I have not given the narrow sense a lot of consideration as that has not been a focus in my own research.

I would think you could do the narrow inference in marginaleffects by using the REs in predictions and specifying, for the input dataset, one specific actually observed ID for the REs. So rather than average across many IDs, only use one.

JWiley commented 2 years ago

@vincentarelbundock Intriguing. I had not seen the sample new levels. Yes that sounds on target. Agreed you'd not have identical replication due to random sampling. Can you control the number of levels? the k = 100L is specifying 100 draws from the assumed RE distribution. As that goes up, you ought to converge on the theoretical integral.

How do the credible intervals look?

vincentarelbundock commented 2 years ago

Yep, it’s easy to increase the number of draws, and the CI look basically the same. However, posterior_epred is very slow, so your implementation is still a lot faster.

Here are three ultra-minimal examples.

library(brms)
library(brmsmargins)

mod <- brm(am ~ vs + (1 | gear), data = mtcars, family = "bernoulli", backend = "cmdstanr")

fixedonly

brmsmargins(
  object = mod,
  at = data.frame(vs = c(0, 1)),
  CI = .95, CIType = "ETI",
  contrasts = cbind("AME vs" = c(-1, 1)),
  effects = "fixedonly")$ContrastSummary |>
  data.frame()
#>            M        Mdn        LL        UL PercentROPE PercentMID   CI CIType
#> 1 -0.4413672 -0.4098166 -0.999893 0.1445257          NA         NA 0.95    ETI
#>   ROPE  MID  Label
#> 1 <NA> <NA> AME vs

nd <- expand.grid(vs = 0:1, gear = 1)
nd$H <- ifelse(nd$vs == 0, -1, 1)
p <- brms::posterior_epred(
  mod,
  re_formula = NA,
  newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
#>        50%       2.5%      97.5% 
#> -0.4098166 -0.9998930  0.1445257

includeRE

brmsmargins(
  object = mod,
  at = data.frame(vs = c(0, 1)),
  CI = .95, CIType = "ETI",
  contrasts = cbind("AME vs" = c(-1, 1)),
  effects = "includeRE")$ContrastSummary |>
  data.frame()
#>            M        Mdn         LL         UL PercentROPE PercentMID   CI
#> 1 -0.1321468 -0.1385877 -0.3117927 0.09439569          NA         NA 0.95
#>   CIType ROPE  MID  Label
#> 1    ETI <NA> <NA> AME vs

nd <- rbind(
  transform(mtcars, vs = 0),
  transform(mtcars, vs = 1))
K <- nrow(mtcars)
nd$H <- ifelse(nd$vs == 0, -1 / K, 1 / K)
p <- brms::posterior_epred(
  mod,
  newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
#>         50%        2.5%       97.5% 
#> -0.13858772 -0.31179267  0.09439569

integrateoutRE

K <- 1000

brmsmargins(
  k = K,
  object = mod,
  at = data.frame(vs = c(0, 1)),
  CI = .95, CIType = "ETI",
  contrasts = cbind("AME vs" = c(-1, 1)),
  effects = "integrateoutRE")$ContrastSummary |>
  data.frame()
#>            M        Mdn         LL        UL PercentROPE PercentMID   CI CIType
#> 1 -0.2168032 -0.1879975 -0.6459883 0.0722646          NA         NA 0.95    ETI
#>   ROPE  MID  Label
#> 1 <NA> <NA> AME vs

nd <- expand.grid(
  vs = 0:1,
  gear = sample(1e6:2e6, K))
nd$H <- ifelse(nd$vs == 0, -1 / K, 1 / K)
p <- brms::posterior_epred(
  mod,
  allow_new_levels = TRUE,
  sample_new_levels = "gaussian",
  newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
#>         50%        2.5%       97.5% 
#> -0.18651311 -0.63625547  0.07199052
vincentarelbundock commented 2 years ago

Note on speed: it is the sample_new_levels="gaussian" option which slows things down. Setting this argument to "uncertainty" or "old_levels" (probably fine with many clusters) is very fast.

Edit: slow brms code is probably here: https://github.com/paul-buerkner/brms/blob/941057220d34e22db4ac9bd786cef557cedd1c1a/R/prepare_predictions.R#L1040

Edit 2: And in particular these two lines are very slow, where rmulti_normal calls chol() repeatedly:

vincentarelbundock commented 2 years ago

So sorry for spamming your Github (I should find a different venue)!! But here’s a comparison of the 3 different kinds of “new levels”. In this particular example, “gaussian” comes very close to matching your results, but the other ones are not too far off and are somewhat faster. Using tictoc to give a sense of speed.

library(brms)
library(brmsmargins)
library(tictoc)

dat <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv")
dat$x <- as.numeric(dat$output > median(dat$output))
dat$y <- as.numeric(dat$emp > median(dat$emp))

mod <- brm(y ~ x + (1 | firm), data = dat, backend = "cmdstanr", family = "bernoulli")
K <- 1000

tic()
bm <- brmsmargins(
  k = K,
  object = mod,
  at = data.frame(x = c(0, 1)),
  CI = .95, CIType = "ETI",
  contrasts = cbind("AME x" = c(-1, 1)),
  effects = "integrateoutRE")
bm$ContrastSummary |> data.frame()
#>            M        Mdn         LL        UL PercentROPE PercentMID   CI CIType
#> 1 0.09660978 0.09525802 0.05968685 0.1405216          NA         NA 0.95    ETI
#>   ROPE  MID Label
#> 1 <NA> <NA> AME x
toc()
#> 36.21 sec elapsed

manual <- function(method = "old_levels") {
  nd <- expand.grid(
    x = 0:1,
    firm = sample(1e6:2e6, K))
  nd$H <- ifelse(nd$x == 0, -1 / K, 1 / K)
  p <- posterior_epred(
    mod,
    allow_new_levels = TRUE,
    sample_new_levels = method,
    newdata = nd)
  quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
}

tic()
manual("gaussian")
#>        50%       2.5%      97.5% 
#> 0.09539001 0.05998184 0.13916398
toc()
#> 30.687 sec elapsed

tic()
manual("old_levels")
#>        50%       2.5%      97.5% 
#> 0.10122629 0.08244516 0.12020882
toc()
#> 0.924 sec elapsed

tic()
manual("uncertainty")
#>        50%       2.5%      97.5% 
#> 0.10255900 0.08044943 0.12789991
toc()
#> 2.822 sec elapsed
JWiley commented 2 years ago

@vincentarelbundock Very cool. I've no problems with these discussions continuing here on GitHub. This is looking great. I haven't looked enough through the rest of the brms framework, but indeed the sampling from the RE distribution seems equivalent. Presumably, this would make it easy to support multivariate outcomes as well which I know has been a request of several people?

From what I understand of the "old_levels" approach, I'm not sure this has any guarantee to converge as the number of levels in the RE gets large.

Generally because there are also many posterior draws getting averaged over, I doubt that a very high K is needed, so outside of large data problems, I don't think the longer time for the method = "gaussian" is an issue and comes with certainty around bias & coverage that I think make it strongly preferable if the quantity of interest pertains to the average marginal effect in the sampled population.

I assume you'll add support now in marginaleffects?

vincentarelbundock commented 2 years ago

this would make it easy to support multivariate outcomes as well which I know has been a request of several people?

Yes, that would seem to be the main benefit for you. Should be mostly plug-and-play. Cool!

From what I understand of the “old_levels” approach, I’m not sure this has any guarantee to converge as the number of levels in the RE gets large.

Very good to know, thanks!

Generally because there are also many posterior draws getting averaged over, I doubt that a very high K is needed, so outside of large data problems, I don’t think the longer time for the method = “gaussian” is an issue and comes with certainty around bias & coverage that I think make it strongly preferable if the quantity of interest pertains to the average marginal effect in the sampled population.

That makes intuitive sense. It might eventually be fun to do some more serious benchmarking. I can’t replicate it now, but I ran into a case earlier where the posterior_epred approach was much, much slower than brmsmargins. Probably worth keeping an eye on…

I assume you’ll add support now in marginaleffects?

Yes, the dev version from Github currently has a prototype. I have not tested at all, but it generates output:

library(brms)
library(marginaleffects)

dat <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv")
dat$x <- as.numeric(dat$output > median(dat$output))
dat$y <- round(dat$emp)
dat$y <- ifelse(dat$y > 5, 5, dat$y)

mod <- brm(
  y ~ x + (1 | firm),
  data = dat,
  family = "categorical")
comparisons(
  mod,
  newdata = datagrid(firm = -100:-1),
  allow_new_levels = TRUE,
  sample_new_levels = "gaussian") |>
  summary()
#>   Group Term Contrast   Effect     2.5 %   97.5 %
#> 1     0    x    1 - 0 -0.03499 -0.078327 -0.01163
#> 2     1    x    1 - 0 -0.08354 -0.136475 -0.04241
#> 3     2    x    1 - 0 -0.02185 -0.063338  0.01209
#> 4     3    x    1 - 0  0.04863  0.018110  0.09265
#> 5     4    x    1 - 0  0.03570  0.007809  0.08047
#> 6     5    x    1 - 0  0.05366  0.018884  0.11054
#> 
#> Model type:  brmsfit 
#> Prediction type:  response
JWiley commented 2 years ago

@vincentarelbundock one more question on your implementation --- how are you handling AMEs for continuous variables? For 100 new levels to be sampled, the entire raw dataset would get expanded 100x?

vincentarelbundock commented 2 years ago

Users can do whatever they want/need.

By default, the datagrid() function holds all the variables which are not explicitly specified at their means or modes. In the code above, we had:

newdata = datagrid(firm = -100:-1)

This means that we make predictions only for a dataset with 100 rows (one per RE level) and all covariates at their means or modes. The result is a data frame with 500 rows: 100 rows per unit with 5 response levels.

You can use the datagridcf() function ("cf" for "counterfactual") to replicate the full dataset 100x:

newdata = datagridcf(firm = -100:-1)

You can also specify for what kind of "hypothetical" individual(s) you want to compute marginaleffects:

newdata = datagrid(x = c(10, 100), z = "dingus", firm = -100:-1)

Then, the summary() function will aggregate by response level. You can also take averages by groups of predictors by adding the by argument in the main comparisons() call.

Does that answer the question?