easystats / modelbased

:chart_with_upwards_trend: Estimate effects, contrasts and means based on statistical models
https://easystats.github.io/modelbased/
GNU General Public License v3.0
234 stars 19 forks source link

`marginaleffects::marginal_means` deprecation #246

Closed vincentarelbundock closed 7 months ago

vincentarelbundock commented 8 months ago

I am preparing for release 1.0.0 of marginaleffects and deprecating functions that will not be part of the “final / stable” user interface.

The marginal_means function now raises a deprecation warning and will be removed eventually.

The recommended strategy is now to use: predictions() with the datagrid() function to create a balanced grid, and the by argument to specify the variables for marginalization.

IMHO, this is more explicit and in line with the rest of the marginaleffects function design:

library(emmeans)
library(marginaleffects)

pigs.lm <- lm(log(conc) ~ source + factor(percent), data = pigs)

emm.src <- emmeans(pigs.lm, specs = "source")
summary(emm.src, infer = TRUE, null = log(35))
#  source emmean     SE df lower.CL upper.CL null t.ratio p.value
#  fish     3.39 0.0367 23     3.32     3.47 3.56  -4.385  0.0002
#  soy      3.67 0.0374 23     3.59     3.74 3.56   2.988  0.0066
#  skim     3.80 0.0394 23     3.72     3.88 3.56   6.130  <.0001
# 
# Results are averaged over the levels of: percent 
# Results are given on the log (not the response) scale. 
# Confidence level used: 0.95

predictions(pigs.lm,                          # predictions using the `pigs.lm` model
  newdata = datagrid(grid_type = "balanced"), # on a balanced grid
  by = "source",                              # averaged across the `source` variable
  hypothesis = log(35),                       # null hypothesis
  df = 23)
# 
#  source Estimate Std. Error     t Pr(>|t|)    S 2.5 % 97.5 % Df
#    fish     3.39     0.0367 -4.39  < 0.001 12.2  3.32   3.47 23
#    soy      3.67     0.0374  2.99  0.00657  7.3  3.59   3.74 23
#    skim     3.80     0.0394  6.13  < 0.001 18.4  3.72   3.88 23
# 
# Columns: source, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, df 
# Type:  response
codecov[bot] commented 8 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (9e2bb40) 35.10% compared to head (3e666dc) 34.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #246 +/- ## ========================================== - Coverage 35.10% 34.00% -1.11% ========================================== Files 25 24 -1 Lines 1168 1147 -21 ========================================== - Hits 410 390 -20 + Misses 758 757 -1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

DominiqueMakowski commented 8 months ago

Thanks Vincent, I've been finally experimenting a bit with the marginaleffects API, so I think I'm in a better position now to push do the overhaul that modelbased needs to depend on the latest marginaleffects

vincentarelbundock commented 8 months ago

Sounds fantastic!

Don't hesitate to ping me if you need input, review, or code snippets.

DominiqueMakowski commented 7 months ago

Hi @vincentarelbundock I'm giving this a shot,

I'd like to make marginaleffects::predictions() work with insight::get_datagrid(), could you let me know if it's possible to, in the following example, somehow specify to predictions() that I want to marginalize over the continuous predictor Sepal.Width. I thought that providing a newdata grid with "missing" values would work, but no, so tried adding the by argument so specify, but it doesn't seem to do what I have in mind.

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
grid <- insight::get_datagrid(model, at="Species")
grid$Sepal.Width <- NULL
grid
#>      Species
#> 1     setosa
#> 2 versicolor
#> 3  virginica

marginaleffects::predictions(model,
                                        newdata=grid,
                                        by="Species")

Created on 2024-02-11 with reprex v2.0.2

tldr; what is the best way to obtain marginal means at species levels marginalized over Sepal.width?

vincentarelbundock commented 7 months ago

What do you mean specifically by "marginalized over Sepal.Width"?

vincentarelbundock commented 7 months ago

predictions() always requires a “full” grid, with all the variables used in the model. You should really think about it as enhanced version of the Base R function predict(). That’s why I used the same argument name newdata.

In addition, you get the by argument which specifies the categorical variables with respect to which we are averaging.

For example, you can marginalize over a balanced grid like this:

library(marginaleffects)

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

grid <- expand.grid(
  Sepal.Width = unique(iris$Sepal.Width),
  Species = unique(iris$Species)
)

predictions(model,
  newdata = grid,
  by = "Species")
# 
#     Species Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
#  setosa         1.44     0.0638 22.6   <0.001 374.6  1.32   1.57
#  versicolor     4.62     0.0930 49.7   <0.001   Inf  4.44   4.80
#  virginica      5.71     0.0668 85.5   <0.001   Inf  5.58   5.84
# 
# Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
# Type:  response

Or you can marginalized over the empirical distribution of the actual data like the following (this is equivalent to the “cells” weights in emmeans):

predictions(model,
  newdata = iris,
  by = "Species")
# 
#     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#  setosa         1.46     0.0545  26.8   <0.001 524.4  1.36   1.57
#  versicolor     4.26     0.0545  78.2   <0.001   Inf  4.15   4.37
#  virginica      5.55     0.0545 101.9   <0.001   Inf  5.45   5.66
# 
# Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
# Type:  response
DominiqueMakowski commented 7 months ago

Say I want to reproduce the emmeans results:

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)
emmeans::emmeans(model, "Species")
#> NOTE: Results may be misleading due to involvement in interactions
#>  Species    emmean     SE  df lower.CL upper.CL
#>  setosa       1.43 0.0766 144     1.28     1.58
#>  versicolor   4.50 0.0742 144     4.35     4.65
#>  virginica    5.61 0.0563 144     5.50     5.72
#> 
#> Confidence level used: 0.95

Is it correct that the "optimal" way to do it in marginaleffects is the first option below:

marginaleffects::predictions(model,
                             newdata=marginaleffects::datagrid(grid_type = "balanced"),
                             variables="Species",
                             by="Species")
#> 
#>     Species Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.43     0.0766 18.7   <0.001 256.7  1.28   1.58
#>  versicolor     4.50     0.0742 60.6   <0.001   Inf  4.36   4.65
#>  virginica      5.61     0.0563 99.6   <0.001   Inf  5.50   5.72
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

marginaleffects::predictions(model,
                             by="Species")
#> 
#>     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.46     0.0545  26.8   <0.001 524.4  1.36   1.57
#>  versicolor     4.26     0.0545  78.2   <0.001   Inf  4.15   4.37
#>  virginica      5.55     0.0545 101.9   <0.001   Inf  5.45   5.66
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

Created on 2024-02-11 with reprex v2.0.2

I would intuitively expect the second to give the same results...

DominiqueMakowski commented 7 months ago

Oh I see, after reading your comment I think that basically marginaleffects requires to explicitly state what to marginalize over.

If I understand, there are roughly 3 types of "means" you could get from the model:

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

# Marginalized over range
marginaleffects::predictions(model,
                             newdata = insight::get_datagrid(model),
                             by = "Species")
#> 
#>     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.47     0.0548  26.7   <0.001 521.2  1.36   1.57
#>  versicolor     4.17     0.0574  72.7   <0.001   Inf  4.06   4.29
#>  virginica      5.52     0.0549 100.6   <0.001   Inf  5.42   5.63
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

# Marginalize over empirical distribution
marginaleffects::predictions(model,
                             newdata = insight::get_data(model),
                             by = "Species")
#> 
#>     Species Estimate Std. Error     z Pr(>|z|)     S 2.5 % 97.5 %
#>  setosa         1.46     0.0545  26.8   <0.001 524.4  1.36   1.57
#>  versicolor     4.26     0.0545  78.2   <0.001   Inf  4.15   4.37
#>  virginica      5.55     0.0545 101.9   <0.001   Inf  5.45   5.66
#> 
#> Columns: Species, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high 
#> Type:  response

# Predictions at fixed mean
marginaleffects::predictions(model,
                             newdata = insight::get_datagrid(model, at="Species"))
#> 
#>  Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %    Species Sepal.Width
#>      1.43     0.0766 18.7   <0.001 256.7  1.28   1.58 setosa            3.06
#>      4.50     0.0742 60.6   <0.001   Inf  4.36   4.65 versicolor        3.06
#>      5.61     0.0563 99.6   <0.001   Inf  5.50   5.72 virginica         3.06
#> 
#> Columns: rowid, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, Species, Sepal.Width, Petal.Length 
#> Type:  response

Created on 2024-02-11 with reprex v2.0.2

Am I on the right track? Sorry for the dumb question I'm trying to wrap my head around the newest API

vincentarelbundock commented 7 months ago

Yes, that's exactly right!

And if you're interested, here's a notebook where I show how to get equivalent results to emmeans with "re-gridding" and transformations:

https://arelbundock.com/emmeans_and_marginaleffects.html

DominiqueMakowski commented 7 months ago

Thanks for your help @vincentarelbundock, I think I managed to find a fix for now to have estimate_means() working (I pushed directly on the main so I'll close this but thanks!)