easystats / datawizard

Magic potions to clean and transform your data 🧙
https://easystats.github.io/datawizard/
Other
211 stars 15 forks source link

expand `demean` to `degroup` #24

Open mattansb opened 4 years ago

mattansb commented 4 years ago

Some ideas for expanding the demean function into a more general degroup (or decenter, or ??) function (listed by ease of implementation as I perceive it):

  1. Allow for group-centering around other functions. Popular choices I've seen: median(), min(), max(), also Mode() is popular for categorical predictors.
Mode <- function(x, multimodal = FALSE) {
  uniqv <- unique(x)
  tab <- tabulate(match(x, uniqv))
  if (multimodal) {
    idx <- which(tab==max(tab))
  } else {
    idx <- which.max(tab)
  }
  uniqv[idx]
}
  1. Allow for more than 1 grouping var Order of operations would be: split y by G1, then split y_between by G2, etc... (Would need a better naming scheme?)

  2. Center around an indexed value. For example, center y around y[time==0], or y[condition=="a"]. Can be mixed with (1): max(y[time==0]), etc.

strengejacke commented 4 years ago

I think the idea behind centering at mean is that this removes the correlation between higher and lower level predictors in a mixed models context. So for "demeaning", the centering at the mean value would be appropriate. Nonetheless, we could add further options.

mattansb commented 4 years ago

How I was taught it, person-centering is done to split two effects of X on Y: the stable "trait" of X and the unstable "situational" part of X. Using the mean as a measure of the "trait" part also has the benefit of uncorrelating these parts, but the actual centrality index depends on the researcher's question (e.g., "How do differences in the starting values of x predict y, and how do changes from the starting point predict y?" would use x[time==0] for centering, etc...).

strengejacke commented 4 years ago

I thought the crucial part is the correlation of level1 and level2 predictors, which violates model assumptions? I agree that centering at other sensible values might be better in certain occasions, but here it was a pure statistical / mathematical reason to choose the mean? Anyway, we can enhance this method.

mattansb commented 4 years ago

I thought the crucial part is the correlation of level1 and level2 predictors, which violates model assumptions?

Hmmm which assumption? Plain old multicollinearity? If its bad, it will hurt interpretability of the coefficiants - so there is a possible trade off here between interpretability and interpretability 😅 But afaik a little multicollinearity never hurt anyone 😎