easystats / modelbased

:chart_with_upwards_trend: Estimate effects, contrasts and means based on statistical models
https://easystats.github.io/modelbased/
GNU General Public License v3.0
232 stars 19 forks source link

marginalmeans backend for estimate_means() #214

Open DominiqueMakowski opened 1 year ago

DominiqueMakowski commented 1 year ago

Okay I made a few attempts, but the issue is I cannot seem to be able to pass a datagrid to marginaleffects::marginalmeans(). Which blocks me from doing things like:

library(modelbased)
#> Warning: package 'modelbased' was built under R version 4.2.1

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

estimate_means(model, fixed = "Sepal.Width")
#> We selected `at = c("Species")`.
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI
#> -----------------------------------------------------
#> setosa     |        3.06 | 1.43 | 0.08 | [1.28, 1.58]
#> versicolor |        3.06 | 4.50 | 0.07 | [4.35, 4.65]
#> virginica  |        3.06 | 5.61 | 0.06 | [5.50, 5.72]
#> 
#> Marginal means estimated at Species
estimate_means(model, at = c("Species", "Sepal.Width"), length = 2)
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI
#> -----------------------------------------------------
#> setosa     |        2.00 | 1.35 | 0.21 | [0.92, 1.77]
#> versicolor |        2.00 | 3.61 | 0.15 | [3.33, 3.90]
#> virginica  |        2.00 | 4.88 | 0.17 | [4.54, 5.23]
#> setosa     |        4.40 | 1.54 | 0.15 | [1.24, 1.84]
#> versicolor |        4.40 | 5.63 | 0.29 | [5.05, 6.20]
#> virginica  |        4.40 | 6.53 | 0.25 | [6.04, 7.02]
#> 
#> Marginal means estimated at Species, Sepal.Width
estimate_means(model, at = "Species=c('versicolor', 'setosa')")
#> NOTE: Results may be misleading due to involvement in interactions
#> Estimated Marginal Means
#> 
#> Species    | Mean |   SE |       95% CI
#> ---------------------------------------
#> versicolor | 4.50 | 0.07 | [4.35, 4.65]
#> setosa     | 1.43 | 0.08 | [1.28, 1.58]
#> 
#> Marginal means estimated at Species
estimate_means(model, at = "Sepal.Width=c(2, 4)")
#> NOTE: Results may be misleading due to involvement in interactions
#> Estimated Marginal Means
#> 
#> Sepal.Width | Mean |   SE |       95% CI
#> ----------------------------------------
#> 2.00        | 3.28 | 0.10 | [3.07, 3.49]
#> 4.00        | 4.35 | 0.10 | [4.15, 4.55]
#> 
#> Marginal means estimated at Sepal.Width
estimate_means(model, at = c("Species", "Sepal.Width=0"))
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI
#> -----------------------------------------------------
#> setosa     |        0.00 | 1.18 | 0.50 | [0.19, 2.17]
#> versicolor |        0.00 | 1.93 | 0.49 | [0.97, 2.90]
#> virginica  |        0.00 | 3.51 | 0.51 | [2.50, 4.52]
#> 
#> Marginal means estimated at Species, Sepal.Width
estimate_means(model, at = "Sepal.Width", length = 5)
#> NOTE: Results may be misleading due to involvement in interactions
#> Estimated Marginal Means
#> 
#> Sepal.Width | Mean |   SE |       95% CI
#> ----------------------------------------
#> 2.00        | 3.28 | 0.10 | [3.07, 3.49]
#> 2.60        | 3.60 | 0.06 | [3.49, 3.71]
#> 3.20        | 3.92 | 0.04 | [3.84, 4.01]
#> 3.80        | 4.25 | 0.08 | [4.08, 4.41]
#> 4.40        | 4.57 | 0.14 | [4.30, 4.84]
#> 
#> Marginal means estimated at Sepal.Width
estimate_means(model, at = "Sepal.Width=c(2, 4)")
#> NOTE: Results may be misleading due to involvement in interactions
#> Estimated Marginal Means
#> 
#> Sepal.Width | Mean |   SE |       95% CI
#> ----------------------------------------
#> 2.00        | 3.28 | 0.10 | [3.07, 3.49]
#> 4.00        | 4.35 | 0.10 | [4.15, 4.55]
#> 
#> Marginal means estimated at Sepal.Width

Created on 2022-09-26 by the reprex package (v2.0.1)

Basically controlling at which levels of Sepal.Width I want the means. Is that the right way of doing it or should one directly use predictions on the datagrid?

@bwiernik is there a reason do use marginalmeans() over simply get_predicted on the datagrid one wants the means of? (@vincentarelbundock but I don't wanna tag you each time I need help with some probably basic thing ^^)

bwiernik commented 1 year ago

marginaleffects has its own datagrid function, so passing a grid is definitely possible.

The purpose of marginal means over get_predicted is if you want to marginalize over other predictors (eg, average mean in control group across ethnic groups). marginaleffects has several options for how to handle continuous variables when averaging

vincentarelbundock commented 1 year ago

To begin, let me introduce a conceptual difference between two functions: predictions() and marginalmeans().

predictions() accepts a grid (newdata argument) in the form of a data frame, and it makes predictions for each row of that data frame. When calling predictions(), we can also use the by argument to average-out or "marginalize" across some variables from the model.

marginalmeans() is a handy shortcut function which does not accept a grid, but does three things under the hood:

  1. Create a grid automatically
  2. Call predictions() to make predictions on that grid
  3. Marginalize across some categorical variables

In almost all of the examples you show, there is no averaging across categorical variables going on in estimate_means(), so we don't actually have to call marginalmeans(). Calling predictions() works just fine. At the bottom of this post I give code to replicate the results of all your examples.

Before that, I'll give a few examples of marginalmeans() to illustrate the equivalences.

model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

mod <- lm(mpg ~ factor(am) + factor(vs) + factor(gear), data = mtcars)

# 1 call to `marginalmeans()` is equivalent to 3 calls to `estimate_means()`
marginalmeans(mod)
estimate_means(mod, at = "am")
estimate_means(mod, at = "vs")
estimate_means(mod, at = "gear")

# Marginalize across values of `gear`. note the `interaction` argument. 
marginalmeans(mod, variables = c("am", "vs"), interaction = TRUE)
estimate_means(mod, at = c("am", "vs"))

Now, replications of your examples:

library(modelbased)
library(marginaleffects)
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

estimate_means(
    model,
    fixed = "Sepal.Width")
predictions(
    model,
    newdata = datagrid(Species = unique))

estimate_means(
    model, 
    at = c("Species", "Sepal.Width"), length = 2)
predictions(
    model,
    newdata = datagrid(Species = unique, Sepal.Width = c(2, 4.4)))

estimate_means(
    model,
    at = "Species=c('versicolor', 'setosa')")
predictions(
    model,
    newdata = datagrid(Species = c("versicolor", "setosa")))

estimate_means(
    model, 
    at = "Sepal.Width",
    length = 5)
predictions(
    model,
    newdata = datagrid(Sepal.Width = fivenum, Species = unique),
    by = "Sepal.Width")

estimate_means(
    model,
    at = "Sepal.Width=c(2, 4)")
predictions(
    model,
    newdata = datagrid(Sepal.Width = c(2, 4), Species = unique),
    by = "Sepal.Width")
vincentarelbundock commented 1 year ago

Of course, you can use the easystats version of the datagrid() builder and feed that to the newdata argument in predictions(). This would bring the two approaches closer in syntax.

strengejacke commented 3 months ago

248 as reminder/reference.

I have some experience now in wrapping marginaleffects functions to get the desired results, so maybe we can try to continue on this issue. The major distinctions/use cases I see (but feel free to correct/add):

Furthermore, for mixed models:

Once we have these different estimate_*() functions/options, we can easily get the contrasts or pairwise comparisons by providing the hypothesis argument to those function calls.

WDYT?