Closed vincentarelbundock closed 7 months ago
All modified and coverable lines are covered by tests :white_check_mark:
Comparison is base (
9e2bb40
) 35.10% compared to head (3e666dc
) 34.00%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Thanks Vincent, I've been finally experimenting a bit with the marginaleffects API, so I think I'm in a better position now to push do the overhaul that modelbased needs to depend on the latest marginaleffects
Sounds fantastic!
Don't hesitate to ping me if you need input, review, or code snippets.
Hi @vincentarelbundock I'm giving this a shot,
I'd like to make marginaleffects::predictions() work with insight::get_datagrid(), could you let me know if it's possible to, in the following example, somehow specify to predictions()
that I want to marginalize over the continuous predictor Sepal.Width. I thought that providing a newdata grid with "missing" values would work, but no, so tried adding the by
argument so specify, but it doesn't seem to do what I have in mind.
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
grid <- insight::get_datagrid(model, at="Species")
grid$Sepal.Width <- NULL
grid
#> Species
#> 1 setosa
#> 2 versicolor
#> 3 virginica
marginaleffects::predictions(model,
newdata=grid,
by="Species")
Created on 2024-02-11 with reprex v2.0.2
tldr; what is the best way to obtain marginal means at species levels marginalized over Sepal.width?
What do you mean specifically by "marginalized over Sepal.Width"?
predictions()
always requires a “full” grid, with all the variables used in the model. You should really think about it as enhanced version of the Base R function predict()
. That’s why I used the same argument name newdata
.
In addition, you get the by
argument which specifies the categorical variables with respect to which we are averaging.
For example, you can marginalize over a balanced grid like this:
library(marginaleffects)
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
grid <- expand.grid(
Sepal.Width = unique(iris$Sepal.Width),
Species = unique(iris$Species)
)
predictions(model,
newdata = grid,
by = "Species")
#
# Species Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
# setosa 1.44 0.0638 22.6 <0.001 374.6 1.32 1.57
# versicolor 4.62 0.0930 49.7 <0.001 Inf 4.44 4.80
# virginica 5.71 0.0668 85.5 <0.001 Inf 5.58 5.84
#
# Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
# Type: response
Or you can marginalized over the empirical distribution of the actual data like the following (this is equivalent to the “cells” weights in emmeans
):
predictions(model,
newdata = iris,
by = "Species")
#
# Species Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
# setosa 1.46 0.0545 26.8 <0.001 524.4 1.36 1.57
# versicolor 4.26 0.0545 78.2 <0.001 Inf 4.15 4.37
# virginica 5.55 0.0545 101.9 <0.001 Inf 5.45 5.66
#
# Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
# Type: response
Say I want to reproduce the emmeans
results:
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
emmeans::emmeans(model, "Species")
#> NOTE: Results may be misleading due to involvement in interactions
#> Species emmean SE df lower.CL upper.CL
#> setosa 1.43 0.0766 144 1.28 1.58
#> versicolor 4.50 0.0742 144 4.35 4.65
#> virginica 5.61 0.0563 144 5.50 5.72
#>
#> Confidence level used: 0.95
Is it correct that the "optimal" way to do it in marginaleffects is the first option below:
marginaleffects::predictions(model,
newdata=marginaleffects::datagrid(grid_type = "balanced"),
variables="Species",
by="Species")
#>
#> Species Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
#> setosa 1.43 0.0766 18.7 <0.001 256.7 1.28 1.58
#> versicolor 4.50 0.0742 60.6 <0.001 Inf 4.36 4.65
#> virginica 5.61 0.0563 99.6 <0.001 Inf 5.50 5.72
#>
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
#> Type: response
marginaleffects::predictions(model,
by="Species")
#>
#> Species Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
#> setosa 1.46 0.0545 26.8 <0.001 524.4 1.36 1.57
#> versicolor 4.26 0.0545 78.2 <0.001 Inf 4.15 4.37
#> virginica 5.55 0.0545 101.9 <0.001 Inf 5.45 5.66
#>
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
#> Type: response
Created on 2024-02-11 with reprex v2.0.2
I would intuitively expect the second to give the same results...
Oh I see, after reading your comment I think that basically marginaleffects requires to explicitly state what to marginalize over.
If I understand, there are roughly 3 types of "means" you could get from the model:
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
# Marginalized over range
marginaleffects::predictions(model,
newdata = insight::get_datagrid(model),
by = "Species")
#>
#> Species Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
#> setosa 1.47 0.0548 26.7 <0.001 521.2 1.36 1.57
#> versicolor 4.17 0.0574 72.7 <0.001 Inf 4.06 4.29
#> virginica 5.52 0.0549 100.6 <0.001 Inf 5.42 5.63
#>
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
#> Type: response
# Marginalize over empirical distribution
marginaleffects::predictions(model,
newdata = insight::get_data(model),
by = "Species")
#>
#> Species Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
#> setosa 1.46 0.0545 26.8 <0.001 524.4 1.36 1.57
#> versicolor 4.26 0.0545 78.2 <0.001 Inf 4.15 4.37
#> virginica 5.55 0.0545 101.9 <0.001 Inf 5.45 5.66
#>
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high
#> Type: response
# Predictions at fixed mean
marginaleffects::predictions(model,
newdata = insight::get_datagrid(model, at="Species"))
#>
#> Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 % Species Sepal.Width
#> 1.43 0.0766 18.7 <0.001 256.7 1.28 1.58 setosa 3.06
#> 4.50 0.0742 60.6 <0.001 Inf 4.36 4.65 versicolor 3.06
#> 5.61 0.0563 99.6 <0.001 Inf 5.50 5.72 virginica 3.06
#>
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, Species, Sepal.Width, Petal.Length
#> Type: response
Created on 2024-02-11 with reprex v2.0.2
Am I on the right track? Sorry for the dumb question I'm trying to wrap my head around the newest API
Yes, that's exactly right!
And if you're interested, here's a notebook where I show how to get equivalent results to emmeans
with "re-gridding" and transformations:
Thanks for your help @vincentarelbundock, I think I managed to find a fix for now to have estimate_means() working (I pushed directly on the main so I'll close this but thanks!)
I am preparing for release 1.0.0 of
marginaleffects
and deprecating functions that will not be part of the “final / stable” user interface.The
marginal_means
function now raises a deprecation warning and will be removed eventually.The recommended strategy is now to use:
predictions()
with thedatagrid()
function to create a balanced grid, and theby
argument to specify the variables for marginalization.IMHO, this is more explicit and in line with the rest of the
marginaleffects
function design: