easystats / datawizard

Magic potions to clean and transform your data 🧙
https://easystats.github.io/datawizard/
Other
201 stars 13 forks source link

standardize with refit without centering #164

Open alaindanet opened 2 years ago

alaindanet commented 2 years ago

Thank you so much for the package!

I would like to know if it is possible to provide an option of standardization without centering when using refit method.

The rationale is that the negative and positive values of some predictive variables can have a meaningful signification (i.e. difference of price over a period), and it that case, it is valuable to only scale the variable and not center them as suggested by Andrew Gelman here and here (Gelman, 2008; actually cited in the documentation of the standardize function):

  1. subtracting the mean of each input variable and dividing by its standard deviation. (Strictly speaking, subtracting the mean is not necessary, but this step allows main effects to be more easily interpreted in the presence of interactions.)

We also center each input variable to have a mean of zero so that interactions are more interpretable. Again, in some applications it can make sense for variables to be centered around some particular baseline value, but we believe our automatic procedure is better than the current default of using whatever value happens to be zero on the scale of the data, which all too commonly results in absurdities such as age = 0 years or party identification = 0 on a 1–7 scale.

In the case where negative and positive values of predictor variables have different meaning, I believe that the centering can change the meaning of the regression coefficients.

I realized that with my own data analysis where a positive coefficient become negative with centering, with the type of explicative variable that I mentionned above.

mattansb commented 2 years ago

Do you have an example (reprex) --without an interaction-- where centering vs non-centering changes the signs on the coefficients (other than the intercept)? It really shouldn't...

strengejacke commented 2 years ago

Maybe standardizing the data before fitting the model can help, you have options to control the reference for centering and dispersion: https://easystats.github.io/datawizard/reference/standardize.html

alaindanet commented 2 years ago

@mattansb It is a two way interactions that display change of sign, it happens where the predictors log2 ratio over temporal data, i.e. log2(x1/x0). I have not so much time right now to reproduce this but it took me a long time to figure out why the results would change.

@strengejacke I agree.

I was thinking about an option in compare_models() function that I was using a bit too automatically.

It is just that this discrepancy made me realize that I should really think about I am doing when standardizing variables. As highlighted in my first post, centering is may be misleading in some cases.

To elaborate a bit, when standardizing coefficients, we think about the formula : $$r_\delta = \beta \dfrac{\sigma_x}{\sigma_y}$$. Standardization with "refit" option leads to centering and scaling of variables which is not so clear right now from the compare_models() function, I guess that some users (including me) are not thinking about centering.

It is not a big deal, but I wanted to raise this issue, to see if a sentence could be added to the documentation of compare_models(), or an option to specify if variables should be center or not, or at least add a ref to Gelman regarding the difference between scaling and centering/scaling.

This said, thank you again for your great package.

mattansb commented 2 years ago

Indeed, if there is an interaction, the simple slopes will change after centering - this is usually something people want (to have the simple slopes represent "main effects").

As @strengejacke pointed out, if you want more fine-grain control, you can standardize each variable as you see fit manually, prior to model fitting.

Seeing how the back-end function (datawizard::standardize.data.frame()) is setup, I don't see the suggested functionality being added right now.

alaindanet commented 2 years ago

Well @mattansb , I agree with you in the case where the 0 values of your predictive variable give little insight as age in the epidemiological study on adult population. But in the case were the 0 values are of interest like a variable describing changes of weight, it sound more relevant to not center the variable to the mean change of weight.

Here again, I quote Gelman (2008):

We also center each input variable to have a mean of zero so that interactions are more interpretable. Again, in some applications it can make sense for variables to be centered around some particular baseline value, but we believe our automatic procedure is better than the current default of using whatever value happens to be zero on the scale of the data, which all too commonly results in absurdities such as age = 0 years or party identification = 0 on a 1–7 scale. Even with such scaling, the correct interpretation of the model can be untangled from the regression by pulling out the right combination of coefficients (for example, evaluating interactions at different plausible values of age such as 20, 40, and 60); the advantage of our procedure is that the default outputs in the regression table can be compared and understood in a consistent way.

That is fine that is low priority! At least, people who have questions about centering/scaling may end up here and read Gelman (2008).

Thank you so much!

mattansb commented 2 years ago

@alaindanet I am aware of these points, even though their application is less commonly used - yes, ideally people would understand their scales and units of measure and would center (or not) variables around sensible values that are derived from domain specific knowledge. But if this is the case, effectsize::standardize() wouldn't be used anyway 😉