Closed jmgirard closed 2 months ago
library(datawizard)
df <- palmerpenguins::penguins
x2a <-
df |>
data_group(species) |>
data_modify(
body_mass_g_species = mean(body_mass_g, na.rm = TRUE)
) |>
data_ungroup() |>
data_group(island) |>
data_modify(
body_mass_g_island = mean(body_mass_g, na.rm = TRUE)
) |>
data_ungroup() |>
data_modify(
body_mass_g_within = body_mass_g - body_mass_g_species - body_mass_g_island
)
out <- degroup(
df,
select = "body_mass_g",
by = c("species", "island"),
suffix_demean = "_within"
)
all.equal(
out$body_mass_g_species_between,
x2a$body_mass_g_species,
check.attributes = FALSE
)
#> [1] TRUE
all.equal(
out$body_mass_g_within,
x2a$body_mass_g_within,
check.attributes = FALSE
)
#> [1] TRUE
x2a <-
df |>
data_group(species) |>
data_modify(
body_mass_g_species = mean(body_mass_g, na.rm = TRUE),
bill_length_mm_species = mean(bill_length_mm, na.rm = TRUE)
) |>
data_ungroup() |>
data_group(island) |>
data_modify(
body_mass_g_island = mean(body_mass_g, na.rm = TRUE),
bill_length_mm_island = mean(bill_length_mm, na.rm = TRUE)
) |>
data_ungroup() |>
data_modify(
body_mass_g_within = body_mass_g - body_mass_g_species - body_mass_g_island,
bill_length_mm_within = bill_length_mm - bill_length_mm_species - bill_length_mm_island
)
out <- degroup(
df,
select = c("body_mass_g", "bill_length_mm"),
by = c("species", "island"),
suffix_demean = "_within"
)
all.equal(
out$body_mass_g_species_between,
x2a$body_mass_g_species,
check.attributes = FALSE
)
#> [1] TRUE
all.equal(
out$bill_length_mm_island_between,
x2a$bill_length_mm_island,
check.attributes = FALSE
)
#> [1] TRUE
all.equal(
out$bill_length_mm_within,
x2a$bill_length_mm_within,
check.attributes = FALSE
)
#> [1] TRUE
all.equal(
out$body_mass_g_within,
x2a$body_mass_g_within,
check.attributes = FALSE
)
#> [1] TRUE
Created on 2024-06-27 with reprex v2.1.0
Do you know how to do this for three levels? For now, when by
> 2, I just subtract all group means from the individual values.
@jmgirard This seems to be appropriate for cross-classified designs, but what about nested 3-level models? In this case, by
could indicate the groups/clusters on level 2 and level 3 - but I think we then need a different way of group means, maybe like an "interaction" of the by
variable? WDYT?
If you cross-classify by more than two clusters, then you just keep subtracting the cluster means. But you're right that things may be different for nested designs like a three-level model or a hybrid model with both nesting and crossing. I'll look into this a bit more.
Here's a quick comparison:
df <- palmerpenguins::penguins
library(easystats)
#> # Attaching packages: easystats 0.7.2.2
#> ✔ bayestestR 0.13.2.2 ✔ correlation 0.8.5
#> ✔ datawizard 0.11.0.5 ✔ effectsize 0.8.8.2
#> ✔ insight 0.20.1.10 ✔ modelbased 0.8.8
#> ✔ performance 0.12.0.4 ✔ parameters 0.22.0.1
#> ✔ report 0.5.8.4 ✔ see 0.8.4.6
# group by species
out <- degroup(
df,
select = "body_mass_g",
by = "species",
suffix_demean = "_within"
)
head(out)
#> body_mass_g_between body_mass_g_within
#> 1 3700.662 49.33775
#> 2 3700.662 99.33775
#> 3 3700.662 -450.66225
#> 4 3700.662 NA
#> 5 3700.662 -250.66225
#> 6 3700.662 -50.66225
# group by year
out <- degroup(
df,
select = "body_mass_g",
by = "year",
suffix_demean = "_within"
)
head(out)
#> body_mass_g_between body_mass_g_within
#> 1 4124.541 -374.5413
#> 2 4124.541 -324.5413
#> 3 4124.541 -874.5413
#> 4 4124.541 NA
#> 5 4124.541 -674.5413
#> 6 4124.541 -474.5413
# cross-classified: large within means?
out <- degroup(
df,
select = "body_mass_g",
by = c("species", "year"),
suffix_demean = "_within"
)
head(out)
#> body_mass_g_species_between body_mass_g_year_between body_mass_g_within
#> 1 3700.662 4124.541 -4075.204
#> 2 3700.662 4124.541 -4025.204
#> 3 3700.662 4124.541 -4575.204
#> 4 3700.662 4124.541 NA
#> 5 3700.662 4124.541 -4375.204
#> 6 3700.662 4124.541 -4175.204
# nested: looks ok
out <- degroup(
df,
select = "body_mass_g",
by = "species*year",
suffix_demean = "_within"
)
#> Warning: Note that de-meaning for nested data structures (i.e. interaction terms
#> for the `by`-variable) is not yet validated. Please check the results.
head(out)
#> body_mass_g_between body_mass_g_within
#> 1 3696.429 53.57143
#> 2 3696.429 103.57143
#> 3 3696.429 -446.42857
#> 4 3696.429 NA
#> 5 3696.429 -246.42857
#> 6 3696.429 -46.42857
Created on 2024-06-27 with reprex v2.1.0
This seems to be the reference for three-level and hybrid models:
Raudenbush, S. W. (2009). Adaptive Centering with Random Effects: An Alternative to the Fixed Effects Model for Studying Time-Varying Treatments in School Settings. Education Finance and Policy, 4(4), 468–491. https://doi.org/10.1162/edfp.2009.4.4.468
See equation (20) on page 484: $$X^ = X - A(A^{T} V^{ -1} A)^{-1} A^T V^{* -1} X$$
I wish people would write R code instead of formulas...
A is composed of elements 1 or 0 such that each element of the random effect vector b is assigned the correct unit;
V∗ are positive definite covariance matrices, considered known.
I guess V* is just vcov()
, but how does A look like, or A(T)?
This is usually the point where I call @bwiernik for help (or maybe @bbolker has time to help out?)... How to translate the formula in https://github.com/easystats/datawizard/issues/520#issuecomment-2195085052 into R code?
I agree. Honestly I read that adaptive centering paper a long time ago but never really understood it well enough to implement it. This is why I was excited by the new Guo et al. paper.
$A$ is what we (in the R/lme4
corner of the world) call the $Z$ matrix, at least in the random-intercept case — an indicator matrix (it wouldn't extend to the random-slopes model, though).
From the paper
$$ Y = 1 \theta + \tilde X \beta + A b + e, \quad b \sim N(0, \Omega), \quad e \sim N(0, V^*) $$
so $A$ is indeed our " $Z$ matrix" (this is the general form of the equation, I don't know if the machinery in the paper supports $Z$ matrices that are more complex than indicator matrices? $V^$ is the covariance matrix of the residuals, which at least for lme4
and glmmTMB
is diagonal (for lme4
it's a multiple of the identity) — in other words, these packages don't do "R-side effects" sensu* SAS.
The main thing is that if you can possibly avoid it you don't want to compute the expression from equation 20 naively/by brute-force. $A$ has dimensions $n \times p$, so $A^\top A$ is a $p \times p$ matrix, where $p$ is the number of latent variables (usually [but not necessarily] $\lt n$, but still large ...)
lme4
does have a quantity $L_\theta$ that might help with this, but I'm not sure (see the lmer
vignette/JSS paper). It's also possible that there's some clever linear algebra that makes this easier at least for a sparse matrix (which $A$ / $Z$ usually is ...) (Maybe QR decomposition, since this equation looks a lot like the standard regression equation??)
The stepwise procedure on page 489 may be easier to convert to R code (but unsure if this procedure applies generally or just for cross-classified cases)
The stepwise procedure on page 489 may be easier to convert to R code
Agreed, although it falls more under the category of "implement a new regression approach" than "manipulate covariates to derive centered versions" ...
Do I understand right that the centering of variables, which usually is done before model fitting, requires the fitted model in this case?
Do I understand right that the centering of variables, which usually is done before model fitting, requires the fitted model in this case?
It seems that way for adaptive centering, though perhaps a simpler approach like that from Guo et al. will be possible.
Adaptive centering doesn't require the fitted model AFAICS. It's just that the centering procedure requires approximately the same computational effort as fitting the regression itself (see the last page of the article; there is a series of three regression fits, but none of them involve the response variable). Probably out of scope for datawizard
though ... ?
Oh wow, what a mess. Will take a look over the next few days off coming up
There's a nice new paper on disaggregating predictors in cross-classified multilevel models (see below). Below, I show the way this could be done manually using datawizard functions. Would be great to support this with
degroup()
.$$ \begin{align} \text{Between } J &= \bar x_j \ \text{Between } K &= \bar xk \ \text{Within} &= x{ijk} - \bar x_j - \bar x_k \ \end{align} $$
Guo, Y., Dhaliwal, J., & Rights, J. D. (2024). Disaggregating level-specific effects in cross-classified multilevel models. Behavior Research Methods, 56(4), 3023–3057. https://doi.org/10.3758/s13428-023-02238-7
Setup
``` r df <- palmerpenguins::penguins library(easystats) #> # Attaching packages: easystats 0.7.2 #> ✔ bayestestR 0.13.2 ✔ correlation 0.8.5 #> ✔ datawizard 0.11.0 ✔ effectsize 0.8.8 #> ✔ insight 0.20.1 ✔ modelbased 0.8.8 #> ✔ performance 0.12.0 ✔ parameters 0.22.0 #> ✔ report 0.5.8 ✔ see 0.8.4 ```Two-level nested example (good)
``` r # Two-level nested example ## Using data_group and data_modify x1a <- df |> data_group(species) |> data_modify( body_mass_g_between = mean(body_mass_g, na.rm = TRUE) ) |> data_ungroup() |> data_modify( body_mass_g_within = body_mass_g - body_mass_g_between ) head(x1a) #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g #> 1 Adelie Torgersen 39.1 18.7 181 3750 #> 2 Adelie Torgersen 39.5 17.4 186 3800 #> 3 Adelie Torgersen 40.3 18.0 195 3250 #> 4 Adelie Torgersen NA NA NA NA #> 5 Adelie Torgersen 36.7 19.3 193 3450 #> 6 Adelie Torgersen 39.3 20.6 190 3650 #> sex year body_mass_g_between body_mass_g_within #> 1 male 2007 3700.662 49.33775 #> 2 female 2007 3700.662 99.33775 #> 3 female 2007 3700.662 -450.66225 #> 4Cross-classified example (request)
``` r # Cross-classified example ## Using data_group and data_modify x2a <- df |> data_group(species) |> data_modify( body_mass_g_species = mean(body_mass_g, na.rm = TRUE) ) |> data_group(island) |> data_modify( body_mass_g_island = mean(body_mass_g, na.rm = TRUE) ) |> data_ungroup() |> data_modify( body_mass_g_within = body_mass_g - body_mass_g_species - body_mass_g_island ) head(x2a) #> species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g #> 1 Adelie Torgersen 39.1 18.7 181 3750 #> 2 Adelie Torgersen 39.5 17.4 186 3800 #> 3 Adelie Torgersen 40.3 18.0 195 3250 #> 4 Adelie Torgersen NA NA NA NA #> 5 Adelie Torgersen 36.7 19.3 193 3450 #> 6 Adelie Torgersen 39.3 20.6 190 3650 #> sex year body_mass_g_species body_mass_g_island body_mass_g_within #> 1 male 2007 3700.662 3706.373 -3657.035 #> 2 female 2007 3700.662 3706.373 -3607.035 #> 3 female 2007 3700.662 3706.373 -4157.035 #> 4