Open isabellaghement opened 2 years ago
Answer 1.
The random effects are only integrated out when effect = "integrateoutRE"
.
Answer 2.
effects = "fixedonly"
will set all REs = 0. So this gives you estimates at the means, if you will for an 'average' personBoth back transformed to the response scale.
Answer 3.
They don't both set the REs to 0, see Answer 2. All REs are integrated out when effects = "integrateoutRE"
and none are when effects ="fixedonly"
Answer 4
effects = "includeRE"
when you want the REs included for predictions. This is appropriate if you want to make inference about specific groups. So for example students are measured from 20 classrooms. REs are by classroom ID. If you want estimates for those 20 individual classrooms, then use effects = "includeRE"
.Answer 5
To get Prob(Y = 1|X1, X2, u) where you have averaged across random realizations of u from the assumed distribution (btw this = integrating over u) use:
effects = "integrateoutRE"
I don't have a particularly convenient way to get that estimate for only the "observed" u random intercept values, but you could generate all the predictions using effects = "includeRE"
and then average yourself. Incidentally, u is never really observed. If you want estimates of the average for the population on response scale, you want to average over the assumed distribution.
Extra
Not an answer to any specific question, but perhaps relevant here: my only real interest case in writing brmsmargins
was/is integrating out REs. Generally I am interested in an estimate and uncertainty around the population average marginal effect.
IMHO the marginal effect for an average person is a fine but odd quantity. Many tools exist for single level models. What I found largely lacking was the ability to generate population average predictions and pop average marginal effects from mixed effects models with non linear transforms / links. So that's really what I'm trying to solve here. effects = "fixedonly"
and effects = "includeRE"
are really only there for testing and because I mean with everything else written why not include those.
Dear Joshua,
Thank you so very much for your clear and informative answers. They are really helpful!
I still have some difficulties understanding your Answer 4. Maybe part of it is because I have bits of information floating through my brain about “broad” and “narrow” inference in the context of random effects models and I am trying to connect your answer to that.
To quote the SAS help website, “in the narrow inference space, conclusions are drawn about the particular values of the random effects selected in the study.”
I guess one could interpret this statement in several ways: 1. We can draw conclusion about a specific random effect (e.g., class # 1 in your example with 20 classes); 2. We can draw conclusions about all random effects in the study (e.g., what is the average effect of some predictor X on the scale of interest across the 20 classes included in the study).
To make this concrete, if our study includes 20 classes (each with multiple student) and Y = student score on a test (measured once for each student; continuous) and X = study time (measured once for each student), then 1. would allow us to determine the effect of X on Y in class # 1. In contrast, 2. would allow us to determine the average effect of X on Y across just the 20 classes included in the study.
What is not clear to me is whether brmsmargins() with type = “effectsRE” computes the effect of X on Y described in 1. or in 2.? Or does it compute the average value of Y (given X) across all the students in class # 1 (say) for 1.? Or does it compute the average value of Y (given X) across all the students in each of the 20 classes and then averages that value across the 20 classes? As you can see, I am really lost in the weeds here (and I haven’t even mentioned the possibility that it may predict the value of Y for a given X for a randomly selected student in class # 1., etc.)
For “broad space inferences”, the 20 classes in your example would be viewed as being representative of a larger universe of classes about which we wish to make inferences. I think here it is clearer to me that we could average a quantity - whether it is an effect on the log odds scale or a probability in the context of a binary mixed effects model - across the entire distribution of random effects. Then that would allow us to compute things like a “marginal coefficient in the broad inference sense” and a “marginal effect in the broad inference sense”, with the “broad sense” taken to mean “on average across all the schools represented by the 20 schools included in our study”.
In my mind, I don’t see any reason why we wouldn’t be able to restrict this averaging to just the distribution of “predicted” random effects in our study (e.g., restrict the averaging to just the 20 schools, when Y = whether or not the student passed the test). This would then allow us to get “marginal coefficients in a narrow sense” and “marginal effects in a narrow sense”, with the “narrow sense” taken to mean “on average across just the 20 schools in our study”.
Am I totally off base here to suggest marginal effects (and marginal coefficients) can be defined either in a “narrow sense” or in a “broad sense”? I have not seen any explicit discussion of this in the literature but in practice it would be helpful to draw such a distinction (if the distinction is warranted).
As a practitioner, I would like the ability to compute both of these types of marginal effects/coefficients (but just don’t have the programming skills to pull it off on my own, unfortunately).
I looked into the marginaleffects package and I can’t tell whether it only does type = “includeRE” (which is my current understanding based on its vignette which compares marginaleffects with brmsmargins). Since I don’t yet fully understand how “includeRE” works in brmsmargins, I struggle to understand what marginaleffects does with the same setting.
Ultimately, I would love it if I could find a way to compute marginal effects and marginal probabilities in a model fitted with the ordbetareg() function in the ordbetareg package. This function is built on top of brms and fits a model to zero-and-one inflated proportions; it fits a single model component at a time, which makes it easier to work with. But it fits effects on a latent scale which is not intuitive, so marginal effects are a must.
Thank you very much for the Extra bonus. That mirrors exactly what I feel except that I don’t have the computational tools to pull off the most sensible quantities from my model (which is frustrating). This is further compounded by not always understanding what else can be computed instead given the currently available R options.
I would very much appreciate any insights/clarifications you can throw my way. These things have been niggling at me for a long time and each time I understand one more nuance, I can start obsessing about some other elusive aspect of statistical practice. 😁
With huge gratitude,
Isabella
Thanks again @JWiley for taking so much time to write this package and answer our newbie questions.
I’m still trying to grok the differences between values of the effect
argument when computing AMEs. If I understand correctly:
fixedonly
makes predictions using re_formula = NA
, takes differences, and averages.includeRE
makes predictions using re_formula = NULL
, takes differences, and averages.integrateoutRE
draws REs from the population distribution, makes predictions, takes differences, and averages.Below, I replicate the results produced by brmsmargins
exactly under options fixedonly
and includeRE
. My question:
Do you think it is possible to achieve something like integrateoutRE
by differencing and averaging the output of this call?
posterior_epred(model, allow_new_levels=TRUE, sample_new_levels="gaussian")
These two arguments do something which, to me, sounds quite similar to what you describe here:
“sample new levels from the (multivariate) normal distribution implied by the group-level standard deviations and correlations. This options may be useful for conducting Bayesian power analysis or predicting new levels in situations where relatively few levels where observed in the old_data.” [brms docs]
Below, I show an example with raw brms
code that matches brmsmargins(effects="integrateoutRE")
results very closely (but not exactly).
library(brms)
library(collapse)
library(data.table)
library(brmsmargins)
h <- .0001
d <- withr::with_seed(
seed = 12345, code = {
nGroups <- 100
nObs <- 20
theta.location <- matrix(rnorm(nGroups * 2), nrow = nGroups, ncol = 2)
theta.location[, 1] <- theta.location[, 1] - mean(theta.location[, 1])
theta.location[, 2] <- theta.location[, 2] - mean(theta.location[, 2])
theta.location[, 1] <- theta.location[, 1] / sd(theta.location[, 1])
theta.location[, 2] <- theta.location[, 2] / sd(theta.location[, 2])
theta.location <- theta.location %*% chol(matrix(c(1.5, -.25, -.25, .5^2), 2))
theta.location[, 1] <- theta.location[, 1] - 2.5
theta.location[, 2] <- theta.location[, 2] + 1
d <- data.table(
x = rep(rep(0:1, each = nObs / 2), times = nGroups))
d[, ID := rep(seq_len(nGroups), each = nObs)]
for (i in seq_len(nGroups)) {
d[ID == i, y := rbinom(
n = nObs,
size = 1,
prob = plogis(theta.location[i, 1] + theta.location[i, 2] * x))
]
}
copy(d)
})
mlogit <- brms::brm(
y ~ 1 + x + (1 + x | ID), family = "bernoulli",
data = d, seed = 1234,
silent = 2, refresh = 0,
chains = 4L, cores = 4L, backend = "cmdstanr")
predict()
functionsThe trick in predict_integrateoutRE
is that we replace the ID
variable by unobserved levels and call posterior_epred
with the allow_new_levels=TRUE
argument. This makes predictions with REs drawn from the population distribution.
predict_fixedonly <- function(newdata) {
brms::posterior_epred(
mlogit,
newdata = newdata,
re_formula = NA)
}
predict_includeRE <- function(newdata) {
brms::posterior_epred(
mlogit,
newdata = newdata)
}
predict_integrateoutRE <- function(newdata) {
# replace known IDs by unknown IDs to draw from the population distribution
oldID <- unique(newdata$ID)
newID <- sample(1e6:2e6, length(oldID))
newdata$ID <- newID[match(newdata$ID, oldID)]
brms::posterior_epred(
mlogit,
allow_new_levels = TRUE,
sample_new_levels = "gaussian",
newdata = newdata)
}
This function calculates the median of the posterior distribution of average marginal effects. FUN
returns a matrix of predictions. Then, we use the collapse
package to take differences between predictions and to normalize by dividing by h
. Finally, we take the median of AME draws.
ame <- function(newdata, FUN) {
p <- FUN(newdata)
p <- t(p)
# use {collapse} to get group-by column-wise (yhat_{x+h} - yhat_{x}) / h
mfx <- collapse::BY(
p,
g = nd[, c("ID", "x_baseline_01")],
FUN = \(x) diff(x) / h)
# average marginal effect
ame <- colMeans(mfx)
# median of posterior distribution of AME
median(ame)
}
nd <- expand.grid(x = c(0, h, 1, 1 + h), ID = 1:100)
nd$x_baseline_01 <- nd$x >= 1
fixedonly
: Exact replicationame(nd, predict_fixedonly)
#> [1] 0.1044999
brmsmargins(
mlogit,
add = data.frame(x = c(0, h)),
contrasts = cbind("AME x" = c(-1 / h, 1 / h)),
effects = "fixedonly", k = 100L, seed = 1234)$ContrastSummary$Mdn
#> [1] 0.1044999
includeRE
: Exact replicationame(nd, predict_includeRE)
#> [1] 0.1116809
brmsmargins(
mlogit,
add = data.frame(x = c(0, h)),
contrasts = cbind("AME x" = c(-1 / h, 1 / h)),
effects = "includeRE", k = 100L, seed = 1234)$ContrastSummary$Mdn
#> [1] 0.1116809
integrateoutRE
: Close replicationHere, the replication is not exact, perhaps because random draws from the gaussian are not the same.
ame(nd, predict_integrateoutRE)
#> [1] 0.1110594
brmsmargins(
mlogit,
add = data.frame(x = c(0, h)),
contrasts = cbind("AME x" = c(-1 / h, 1 / h)),
effects = "integrateoutRE", k = 100L, seed = 1234)$ContrastSummary$Mdn
#> [1] 0.1112742
@isabellaghement
I do not think it is off base or unreasonable to consider narrow or broad inference. I have not given the narrow sense a lot of consideration as that has not been a focus in my own research.
I would think you could do the narrow inference in marginaleffects
by using the REs in predictions and specifying, for the input dataset, one specific actually observed ID for the REs. So rather than average across many IDs, only use one.
@vincentarelbundock Intriguing. I had not seen the sample new levels. Yes that sounds on target.
Agreed you'd not have identical replication due to random sampling. Can you control the number of levels? the k = 100L
is specifying 100 draws from the assumed RE distribution. As that goes up, you ought to converge on the theoretical integral.
How do the credible intervals look?
Yep, it’s easy to increase the number of draws, and the CI look basically the same. However, posterior_epred
is very slow, so your implementation is still a lot faster.
Here are three ultra-minimal examples.
library(brms)
library(brmsmargins)
mod <- brm(am ~ vs + (1 | gear), data = mtcars, family = "bernoulli", backend = "cmdstanr")
brmsmargins(
object = mod,
at = data.frame(vs = c(0, 1)),
CI = .95, CIType = "ETI",
contrasts = cbind("AME vs" = c(-1, 1)),
effects = "fixedonly")$ContrastSummary |>
data.frame()
#> M Mdn LL UL PercentROPE PercentMID CI CIType
#> 1 -0.4413672 -0.4098166 -0.999893 0.1445257 NA NA 0.95 ETI
#> ROPE MID Label
#> 1 <NA> <NA> AME vs
nd <- expand.grid(vs = 0:1, gear = 1)
nd$H <- ifelse(nd$vs == 0, -1, 1)
p <- brms::posterior_epred(
mod,
re_formula = NA,
newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
#> 50% 2.5% 97.5%
#> -0.4098166 -0.9998930 0.1445257
brmsmargins(
object = mod,
at = data.frame(vs = c(0, 1)),
CI = .95, CIType = "ETI",
contrasts = cbind("AME vs" = c(-1, 1)),
effects = "includeRE")$ContrastSummary |>
data.frame()
#> M Mdn LL UL PercentROPE PercentMID CI
#> 1 -0.1321468 -0.1385877 -0.3117927 0.09439569 NA NA 0.95
#> CIType ROPE MID Label
#> 1 ETI <NA> <NA> AME vs
nd <- rbind(
transform(mtcars, vs = 0),
transform(mtcars, vs = 1))
K <- nrow(mtcars)
nd$H <- ifelse(nd$vs == 0, -1 / K, 1 / K)
p <- brms::posterior_epred(
mod,
newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
#> 50% 2.5% 97.5%
#> -0.13858772 -0.31179267 0.09439569
K <- 1000
brmsmargins(
k = K,
object = mod,
at = data.frame(vs = c(0, 1)),
CI = .95, CIType = "ETI",
contrasts = cbind("AME vs" = c(-1, 1)),
effects = "integrateoutRE")$ContrastSummary |>
data.frame()
#> M Mdn LL UL PercentROPE PercentMID CI CIType
#> 1 -0.2168032 -0.1879975 -0.6459883 0.0722646 NA NA 0.95 ETI
#> ROPE MID Label
#> 1 <NA> <NA> AME vs
nd <- expand.grid(
vs = 0:1,
gear = sample(1e6:2e6, K))
nd$H <- ifelse(nd$vs == 0, -1 / K, 1 / K)
p <- brms::posterior_epred(
mod,
allow_new_levels = TRUE,
sample_new_levels = "gaussian",
newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
#> 50% 2.5% 97.5%
#> -0.18651311 -0.63625547 0.07199052
Note on speed: it is the sample_new_levels="gaussian"
option which slows things down. Setting this argument to "uncertainty"
or "old_levels"
(probably fine with many clusters) is very fast.
Edit: slow brms
code is probably here: https://github.com/paul-buerkner/brms/blob/941057220d34e22db4ac9bd786cef557cedd1c1a/R/prepare_predictions.R#L1040
Edit 2: And in particular these two lines are very slow, where rmulti_normal
calls chol()
repeatedly:
So sorry for spamming your Github (I should find a different venue)!! But here’s a comparison of the 3 different kinds of “new levels”. In this particular example, “gaussian” comes very close to matching your results, but the other ones are not too far off and are somewhat faster. Using tictoc
to give a sense of speed.
library(brms)
library(brmsmargins)
library(tictoc)
dat <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv")
dat$x <- as.numeric(dat$output > median(dat$output))
dat$y <- as.numeric(dat$emp > median(dat$emp))
mod <- brm(y ~ x + (1 | firm), data = dat, backend = "cmdstanr", family = "bernoulli")
K <- 1000
tic()
bm <- brmsmargins(
k = K,
object = mod,
at = data.frame(x = c(0, 1)),
CI = .95, CIType = "ETI",
contrasts = cbind("AME x" = c(-1, 1)),
effects = "integrateoutRE")
bm$ContrastSummary |> data.frame()
#> M Mdn LL UL PercentROPE PercentMID CI CIType
#> 1 0.09660978 0.09525802 0.05968685 0.1405216 NA NA 0.95 ETI
#> ROPE MID Label
#> 1 <NA> <NA> AME x
toc()
#> 36.21 sec elapsed
manual <- function(method = "old_levels") {
nd <- expand.grid(
x = 0:1,
firm = sample(1e6:2e6, K))
nd$H <- ifelse(nd$x == 0, -1 / K, 1 / K)
p <- posterior_epred(
mod,
allow_new_levels = TRUE,
sample_new_levels = method,
newdata = nd)
quantile(nd$H %*% t(p), probs = c(.5, .025, .975))
}
tic()
manual("gaussian")
#> 50% 2.5% 97.5%
#> 0.09539001 0.05998184 0.13916398
toc()
#> 30.687 sec elapsed
tic()
manual("old_levels")
#> 50% 2.5% 97.5%
#> 0.10122629 0.08244516 0.12020882
toc()
#> 0.924 sec elapsed
tic()
manual("uncertainty")
#> 50% 2.5% 97.5%
#> 0.10255900 0.08044943 0.12789991
toc()
#> 2.822 sec elapsed
@vincentarelbundock Very cool. I've no problems with these discussions continuing here on GitHub.
This is looking great. I haven't looked enough through the rest of the brms
framework, but indeed the sampling from the RE distribution seems equivalent. Presumably, this would make it easy to support multivariate outcomes as well which I know has been a request of several people?
From what I understand of the "old_levels"
approach, I'm not sure this has any guarantee to converge as the number of levels in the RE gets large.
Generally because there are also many posterior draws getting averaged over, I doubt that a very high K
is needed, so outside of large data problems, I don't think the longer time for the method = "gaussian"
is an issue and comes with certainty around bias & coverage that I think make it strongly preferable if the quantity of interest pertains to the average marginal effect in the sampled population.
I assume you'll add support now in marginaleffects
?
this would make it easy to support multivariate outcomes as well which I know has been a request of several people?
Yes, that would seem to be the main benefit for you. Should be mostly plug-and-play. Cool!
From what I understand of the “old_levels” approach, I’m not sure this has any guarantee to converge as the number of levels in the RE gets large.
Very good to know, thanks!
Generally because there are also many posterior draws getting averaged over, I doubt that a very high K is needed, so outside of large data problems, I don’t think the longer time for the method = “gaussian” is an issue and comes with certainty around bias & coverage that I think make it strongly preferable if the quantity of interest pertains to the average marginal effect in the sampled population.
That makes intuitive sense. It might eventually be fun to do some more serious benchmarking. I can’t replicate it now, but I ran into a case earlier where the posterior_epred
approach was much, much slower than brmsmargins
. Probably worth keeping an eye on…
I assume you’ll add support now in marginaleffects?
Yes, the dev version from Github currently has a prototype. I have not tested at all, but it generates output:
library(brms)
library(marginaleffects)
dat <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/plm/EmplUK.csv")
dat$x <- as.numeric(dat$output > median(dat$output))
dat$y <- round(dat$emp)
dat$y <- ifelse(dat$y > 5, 5, dat$y)
mod <- brm(
y ~ x + (1 | firm),
data = dat,
family = "categorical")
comparisons(
mod,
newdata = datagrid(firm = -100:-1),
allow_new_levels = TRUE,
sample_new_levels = "gaussian") |>
summary()
#> Group Term Contrast Effect 2.5 % 97.5 %
#> 1 0 x 1 - 0 -0.03499 -0.078327 -0.01163
#> 2 1 x 1 - 0 -0.08354 -0.136475 -0.04241
#> 3 2 x 1 - 0 -0.02185 -0.063338 0.01209
#> 4 3 x 1 - 0 0.04863 0.018110 0.09265
#> 5 4 x 1 - 0 0.03570 0.007809 0.08047
#> 6 5 x 1 - 0 0.05366 0.018884 0.11054
#>
#> Model type: brmsfit
#> Prediction type: response
@vincentarelbundock one more question on your implementation --- how are you handling AMEs for continuous variables? For 100 new levels to be sampled, the entire raw dataset would get expanded 100x?
Users can do whatever they want/need.
By default, the datagrid()
function holds all the variables which are not explicitly specified at their means or modes. In the code above, we had:
newdata = datagrid(firm = -100:-1)
This means that we make predictions only for a dataset with 100 rows (one per RE level) and all covariates at their means or modes. The result is a data frame with 500 rows: 100 rows per unit with 5 response levels.
You can use the datagridcf()
function ("cf" for "counterfactual") to replicate the full dataset 100x:
newdata = datagridcf(firm = -100:-1)
You can also specify for what kind of "hypothetical" individual(s) you want to compute marginaleffects:
newdata = datagrid(x = c(10, 100), z = "dingus", firm = -100:-1)
Then, the summary()
function will aggregate by response level. You can also take averages by groups of predictors by adding the by
argument in the main comparisons()
call.
Does that answer the question?
Dear Joshua,
Thank you very much for developing the brmsmargins package. I am trying to understand a paragraph included in the help file for the brmsmargins() function and would appreciate any clarifications you are able to provide. The paragraph in question is as follows:
"Use the fitted() function to generate predictions based on this modified dataset. If effects is set to “fixedonly” (meaning only generate predictions using fixed effects) or to “includeRE” (meaning generate predictions using fixed and random effects), then predictions are generated entirely using the fitted() function and are, typically back transformed to the response scale. For mixed effects models with fixed and random effects where effects is set to “integrateoutRE”, then fitted() is only used to generate predictions using the fixed effects on the linear scale. For each prediction generated, the random effects are integrated out by drawing k random samples from the model assumed random effect(s) distribution. These are added to the fixed effects predictions, back transformed, and then averaged over all k random samples to perform numerical Monte Carlo integration."
Question 1
Can you confirm whether the last part of the above paragraph, which starts with "For each prediction generated, the random effects are integrated out", applies to each of the settings effects = "fixedonly”, effects = "includeRE" and effects =“integrateoutRE”? Or does it apply only to the effects =“integrateoutRE” setting?
Question 2
Can you confirm whether both effects = "fixedonly” and effects =“integrateoutRE” ignore the random effects (i.e., they set them to 0), with the difference between them being that effects = "fixedonly” returns the results on the natural response scale while effects =“integrateoutRE” on the link scale? (For example, for a mixed effects binary regression model, effects = "fixedonly” returns fitted probabilities whereas effects =“integrateoutRE” returns fitted log odds.)
Question 3
If both effects = "fixedonly” and effects =“integrateoutRE” set the random effects in the fitted part of the model to 0, I do not understand why we would want to "integrate out the random effects" for these two settings. The only setting for which it would make sense to "integrate out the random effects" would be effects = "includeRE" (since that is the only setting where the random effects would NOT be set to 0; unless, perhaps, there are multiple random effects in the model and only some of those would be set to 0?). I don't see a re_formula as an argument to brmsmargins() which would be used to indicate which random effects are set to 0 and which are not, so I am not sure how one would even specify which random effects are set to 0 and which are kept in the model?
Question 4
When would one want to use effects = "includeRE" versus effects =“integrateoutRE”?
Question 5
Getting back to the example of a mixed effects binary logistic regression model (with just a random intercept u for simplicity and 2 predictors, X1 and X2), if I needed to compute something like Prob(Y = 1|X1, X2, u) for X1 and X2 set at 2 relevant values x1 and x2 and u averaged across the observed random intercept values, would there be a way to get this out of brmsmargins? If I then needed to compute Prob(Y = 1|X1, X2, u) but averaged across random realizations of u from the assumed distribution of the random intercepts - say, Normal(0, sigma_u^2) with sigma_u^2 estimated from the data - would there be a way to compute this via brmsmargins? What does the latter estimate? What does the former estimate? Why not use the former instead of the latter (e.g., because we can get a better approximation if we use more random effects than what was observed in the data)?