easystats / datawizard

Magic potions to clean and transform your data 🧙
https://easystats.github.io/datawizard/
Other
209 stars 15 forks source link

degroup() for cross-classified data #520

Closed jmgirard closed 2 months ago

jmgirard commented 2 months ago

There's a nice new paper on disaggregating predictors in cross-classified multilevel models (see below). Below, I show the way this could be done manually using datawizard functions. Would be great to support this with degroup().

$$ \begin{align} \text{Between } J &= \bar x_j \ \text{Between } K &= \bar xk \ \text{Within} &= x{ijk} - \bar x_j - \bar x_k \ \end{align} $$

Guo, Y., Dhaliwal, J., & Rights, J. D. (2024). Disaggregating level-specific effects in cross-classified multilevel models. Behavior Research Methods, 56(4), 3023–3057. https://doi.org/10.3758/s13428-023-02238-7

Setup ``` r df <- palmerpenguins::penguins library(easystats) #> # Attaching packages: easystats 0.7.2 #> ✔ bayestestR 0.13.2 ✔ correlation 0.8.5 #> ✔ datawizard 0.11.0 ✔ effectsize 0.8.8 #> ✔ insight 0.20.1 ✔ modelbased 0.8.8 #> ✔ performance 0.12.0 ✔ parameters 0.22.0 #> ✔ report 0.5.8 ✔ see 0.8.4 ```
Two-level nested example (good) ``` r # Two-level nested example ## Using data_group and data_modify x1a <- df |> data_group(species) |> data_modify( body_mass_g_between = mean(body_mass_g, na.rm = TRUE) ) |> data_ungroup() |> data_modify( body_mass_g_within = body_mass_g - body_mass_g_between ) head(x1a) #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g #> 1 Adelie Torgersen 39.1 18.7 181 3750 #> 2 Adelie Torgersen 39.5 17.4 186 3800 #> 3 Adelie Torgersen 40.3 18.0 195 3250 #> 4 Adelie Torgersen NA NA NA NA #> 5 Adelie Torgersen 36.7 19.3 193 3450 #> 6 Adelie Torgersen 39.3 20.6 190 3650 #> sex year body_mass_g_between body_mass_g_within #> 1 male 2007 3700.662 49.33775 #> 2 female 2007 3700.662 99.33775 #> 3 female 2007 3700.662 -450.66225 #> 4 2007 3700.662 NA #> 5 female 2007 3700.662 -250.66225 #> 6 male 2007 3700.662 -50.66225 ``` ``` r ## Using degroup x1b <- degroup(df, select = "body_mass_g", by = "species") head(x1b) #> body_mass_g_between body_mass_g_within #> 1 3700.662 49.33775 #> 2 3700.662 99.33775 #> 3 3700.662 -450.66225 #> 4 3700.662 NA #> 5 3700.662 -250.66225 #> 6 3700.662 -50.66225 ```
Cross-classified example (request) ``` r # Cross-classified example ## Using data_group and data_modify x2a <- df |> data_group(species) |> data_modify( body_mass_g_species = mean(body_mass_g, na.rm = TRUE) ) |> data_group(island) |> data_modify( body_mass_g_island = mean(body_mass_g, na.rm = TRUE) ) |> data_ungroup() |> data_modify( body_mass_g_within = body_mass_g - body_mass_g_species - body_mass_g_island ) head(x2a) #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g #> 1 Adelie Torgersen 39.1 18.7 181 3750 #> 2 Adelie Torgersen 39.5 17.4 186 3800 #> 3 Adelie Torgersen 40.3 18.0 195 3250 #> 4 Adelie Torgersen NA NA NA NA #> 5 Adelie Torgersen 36.7 19.3 193 3450 #> 6 Adelie Torgersen 39.3 20.6 190 3650 #> sex year body_mass_g_species body_mass_g_island body_mass_g_within #> 1 male 2007 3700.662 3706.373 -3657.035 #> 2 female 2007 3700.662 3706.373 -3607.035 #> 3 female 2007 3700.662 3706.373 -4157.035 #> 4 2007 3700.662 3706.373 NA #> 5 female 2007 3700.662 3706.373 -3957.035 #> 6 male 2007 3700.662 3706.373 -3757.035 ``` ``` r ## Desired/suggested syntax for degroup degroup( df, select = "body_mass_g", by = c("species", "island"), suffix_demean = "_within", suffix_groupmean = c("_species", "_island") ) #> Error in `dat[[by]]`: #> ! Can't extract column with `by`. #> ✖ Subscript `by` must be size 1, not 2. ```
strengejacke commented 2 months ago
library(datawizard)
df <- palmerpenguins::penguins
x2a <- 
  df |> 
  data_group(species) |> 
  data_modify(
    body_mass_g_species = mean(body_mass_g, na.rm = TRUE)
  ) |> 
  data_ungroup() |> 
  data_group(island) |> 
  data_modify(
    body_mass_g_island = mean(body_mass_g, na.rm = TRUE)
  ) |> 
  data_ungroup() |> 
  data_modify(
    body_mass_g_within = body_mass_g - body_mass_g_species - body_mass_g_island
  )

out <- degroup(
  df,
  select = "body_mass_g",
  by = c("species", "island"),
  suffix_demean = "_within"
)

all.equal(
  out$body_mass_g_species_between,
  x2a$body_mass_g_species,
  check.attributes = FALSE
)
#> [1] TRUE
all.equal(
  out$body_mass_g_within,
  x2a$body_mass_g_within,
  check.attributes = FALSE
)
#> [1] TRUE

x2a <-
  df |>
  data_group(species) |>
  data_modify(
    body_mass_g_species = mean(body_mass_g, na.rm = TRUE),
    bill_length_mm_species = mean(bill_length_mm, na.rm = TRUE)
  ) |>
  data_ungroup() |>
  data_group(island) |>
  data_modify(
    body_mass_g_island = mean(body_mass_g, na.rm = TRUE),
    bill_length_mm_island = mean(bill_length_mm, na.rm = TRUE)
  ) |>
  data_ungroup() |>
  data_modify(
    body_mass_g_within = body_mass_g - body_mass_g_species - body_mass_g_island,
    bill_length_mm_within = bill_length_mm - bill_length_mm_species - bill_length_mm_island
  )

out <- degroup(
  df,
  select = c("body_mass_g", "bill_length_mm"),
  by = c("species", "island"),
  suffix_demean = "_within"
)

all.equal(
  out$body_mass_g_species_between,
  x2a$body_mass_g_species,
  check.attributes = FALSE
)
#> [1] TRUE
all.equal(
  out$bill_length_mm_island_between,
  x2a$bill_length_mm_island,
  check.attributes = FALSE
)
#> [1] TRUE
all.equal(
  out$bill_length_mm_within,
  x2a$bill_length_mm_within,
  check.attributes = FALSE
)
#> [1] TRUE
all.equal(
  out$body_mass_g_within,
  x2a$body_mass_g_within,
  check.attributes = FALSE
)
#> [1] TRUE

Created on 2024-06-27 with reprex v2.1.0

strengejacke commented 2 months ago

Do you know how to do this for three levels? For now, when by > 2, I just subtract all group means from the individual values.

strengejacke commented 2 months ago

@jmgirard This seems to be appropriate for cross-classified designs, but what about nested 3-level models? In this case, by could indicate the groups/clusters on level 2 and level 3 - but I think we then need a different way of group means, maybe like an "interaction" of the by variable? WDYT?

jmgirard commented 2 months ago

If you cross-classify by more than two clusters, then you just keep subtracting the cluster means. But you're right that things may be different for nested designs like a three-level model or a hybrid model with both nesting and crossing. I'll look into this a bit more.

strengejacke commented 2 months ago

Here's a quick comparison:

df <- palmerpenguins::penguins
library(easystats)
#> # Attaching packages: easystats 0.7.2.2
#> ✔ bayestestR  0.13.2.2    ✔ correlation 0.8.5    
#> ✔ datawizard  0.11.0.5    ✔ effectsize  0.8.8.2  
#> ✔ insight     0.20.1.10   ✔ modelbased  0.8.8    
#> ✔ performance 0.12.0.4    ✔ parameters  0.22.0.1 
#> ✔ report      0.5.8.4     ✔ see         0.8.4.6

# group by species
out <- degroup(
  df,
  select = "body_mass_g",
  by = "species",
  suffix_demean = "_within"
)
head(out)
#>   body_mass_g_between body_mass_g_within
#> 1            3700.662           49.33775
#> 2            3700.662           99.33775
#> 3            3700.662         -450.66225
#> 4            3700.662                 NA
#> 5            3700.662         -250.66225
#> 6            3700.662          -50.66225

# group by year
out <- degroup(
  df,
  select = "body_mass_g",
  by = "year",
  suffix_demean = "_within"
)
head(out)
#>   body_mass_g_between body_mass_g_within
#> 1            4124.541          -374.5413
#> 2            4124.541          -324.5413
#> 3            4124.541          -874.5413
#> 4            4124.541                 NA
#> 5            4124.541          -674.5413
#> 6            4124.541          -474.5413

# cross-classified: large within means?
out <- degroup(
  df,
  select = "body_mass_g",
  by = c("species", "year"),
  suffix_demean = "_within"
)
head(out)
#>   body_mass_g_species_between body_mass_g_year_between body_mass_g_within
#> 1                    3700.662                 4124.541          -4075.204
#> 2                    3700.662                 4124.541          -4025.204
#> 3                    3700.662                 4124.541          -4575.204
#> 4                    3700.662                 4124.541                 NA
#> 5                    3700.662                 4124.541          -4375.204
#> 6                    3700.662                 4124.541          -4175.204

# nested: looks ok
out <- degroup(
  df,
  select = "body_mass_g",
  by = "species*year",
  suffix_demean = "_within"
)
#> Warning: Note that de-meaning for nested data structures (i.e. interaction terms
#>   for the `by`-variable) is not yet validated. Please check the results.
head(out)
#>   body_mass_g_between body_mass_g_within
#> 1            3696.429           53.57143
#> 2            3696.429          103.57143
#> 3            3696.429         -446.42857
#> 4            3696.429                 NA
#> 5            3696.429         -246.42857
#> 6            3696.429          -46.42857

Created on 2024-06-27 with reprex v2.1.0

jmgirard commented 2 months ago

This seems to be the reference for three-level and hybrid models:

Raudenbush, S. W. (2009). Adaptive Centering with Random Effects: An Alternative to the Fixed Effects Model for Studying Time-Varying Treatments in School Settings. Education Finance and Policy, 4(4), 468–491. https://doi.org/10.1162/edfp.2009.4.4.468

See equation (20) on page 484: $$X^ = X - A(A^{T} V^{ -1} A)^{-1} A^T V^{* -1} X$$

strengejacke commented 2 months ago

I wish people would write R code instead of formulas...

A is composed of elements 1 or 0 such that each element of the random effect vector b is assigned the correct unit;

V∗ are positive definite covariance matrices, considered known.

I guess V* is just vcov(), but how does A look like, or A(T)?

strengejacke commented 2 months ago

This is usually the point where I call @bwiernik for help (or maybe @bbolker has time to help out?)... How to translate the formula in https://github.com/easystats/datawizard/issues/520#issuecomment-2195085052 into R code?

jmgirard commented 2 months ago

I agree. Honestly I read that adaptive centering paper a long time ago but never really understood it well enough to implement it. This is why I was excited by the new Guo et al. paper.

bbolker commented 2 months ago

$A$ is what we (in the R/lme4 corner of the world) call the $Z$ matrix, at least in the random-intercept case — an indicator matrix (it wouldn't extend to the random-slopes model, though).

From the paper

$$ Y = 1 \theta + \tilde X \beta + A b + e, \quad b \sim N(0, \Omega), \quad e \sim N(0, V^*) $$

so $A$ is indeed our " $Z$ matrix" (this is the general form of the equation, I don't know if the machinery in the paper supports $Z$ matrices that are more complex than indicator matrices? $V^$ is the covariance matrix of the residuals, which at least for lme4 and glmmTMB is diagonal (for lme4 it's a multiple of the identity) — in other words, these packages don't do "R-side effects" sensu* SAS.

The main thing is that if you can possibly avoid it you don't want to compute the expression from equation 20 naively/by brute-force. $A$ has dimensions $n \times p$, so $A^\top A$ is a $p \times p$ matrix, where $p$ is the number of latent variables (usually [but not necessarily] $\lt n$, but still large ...)

lme4 does have a quantity $L_\theta$ that might help with this, but I'm not sure (see the lmer vignette/JSS paper). It's also possible that there's some clever linear algebra that makes this easier at least for a sparse matrix (which $A$ / $Z$ usually is ...) (Maybe QR decomposition, since this equation looks a lot like the standard regression equation??)

jmgirard commented 2 months ago

The stepwise procedure on page 489 may be easier to convert to R code (but unsure if this procedure applies generally or just for cross-classified cases)

bbolker commented 2 months ago

The stepwise procedure on page 489 may be easier to convert to R code

Agreed, although it falls more under the category of "implement a new regression approach" than "manipulate covariates to derive centered versions" ...

strengejacke commented 2 months ago

Do I understand right that the centering of variables, which usually is done before model fitting, requires the fitted model in this case?

jmgirard commented 2 months ago

Do I understand right that the centering of variables, which usually is done before model fitting, requires the fitted model in this case?

It seems that way for adaptive centering, though perhaps a simpler approach like that from Guo et al. will be possible.

bbolker commented 2 months ago

Adaptive centering doesn't require the fitted model AFAICS. It's just that the centering procedure requires approximately the same computational effort as fitting the regression itself (see the last page of the article; there is a series of three regression fits, but none of them involve the response variable). Probably out of scope for datawizard though ... ?

bwiernik commented 2 months ago

Oh wow, what a mess. Will take a look over the next few days off coming up