easystats / parameters

:bar_chart: Computation and processing of models' parameters
https://easystats.github.io/parameters/
GNU General Public License v3.0
421 stars 36 forks source link

improve SMART parameters standardization #708

Open DominiqueMakowski opened 5 years ago

DominiqueMakowski commented 5 years ago

imo this is a critical issue to be able to retrieve 'refit' standardized parameters with a posthoc method.

improvements are possible with especially for the case of interaction, but a more robust and systematic testing framework might be needed. also, a knowledge of model matrices when factors are involved appears as key.

strengejacke commented 4 years ago

I have no idea what you're talking about 😆 I could look at the code, but am too lazy... Can you elaborate a bit more, maybe with an example?

DominiqueMakowski commented 4 years ago

Long story short they are still some cases where the SMART method does not perform well, in comparison to refit and also to "classic". This appears mainly for interaction terms, and for interaction between a continuous and a factor variable.

I have no idea how to improve on that, and I think that to address this we would need someone with a deep understanding on how model matrices are built for interactions terms, and how we can standardize in it the interaction term so that it reflects the interaction term of two standardized variables...

That's more a longterm issue tho, if by any chance we meet someone with some understanding of the model matrices and formulas...

DominiqueMakowski commented 4 years ago

In a nutshell, the goal here is to reconstruct the standardized model.matrix (the 2nd one below) with the original model matrix and the Mean/SD of each variable...

df <- iris
dfZ <- parameters::standardize(iris)

head(model.matrix(~ df$Sepal.Length * df$Species))
#>   (Intercept) df$Sepal.Length df$Speciesversicolor df$Speciesvirginica
#> 1           1             5.1                    0                   0
#> 2           1             4.9                    0                   0
#> 3           1             4.7                    0                   0
#> 4           1             4.6                    0                   0
#> 5           1             5.0                    0                   0
#> 6           1             5.4                    0                   0
#>   df$Sepal.Length:df$Speciesversicolor df$Sepal.Length:df$Speciesvirginica
#> 1                                    0                                   0
#> 2                                    0                                   0
#> 3                                    0                                   0
#> 4                                    0                                   0
#> 5                                    0                                   0
#> 6                                    0                                   0
head(model.matrix(~ dfZ$Sepal.Length * dfZ$Species))
#>   (Intercept) dfZ$Sepal.Length dfZ$Speciesversicolor dfZ$Speciesvirginica
#> 1           1       -0.8976739                     0                    0
#> 2           1       -1.1392005                     0                    0
#> 3           1       -1.3807271                     0                    0
#> 4           1       -1.5014904                     0                    0
#> 5           1       -1.0184372                     0                    0
#> 6           1       -0.5353840                     0                    0
#>   dfZ$Sepal.Length:dfZ$Speciesversicolor
#> 1                                      0
#> 2                                      0
#> 3                                      0
#> 4                                      0
#> 5                                      0
#> 6                                      0
#>   dfZ$Sepal.Length:dfZ$Speciesvirginica
#> 1                                     0
#> 2                                     0
#> 3                                     0
#> 4                                     0
#> 5                                     0
#> 6                                     0

Created on 2019-09-23 by the reprex package (v0.3.0)

DominiqueMakowski commented 4 years ago

I believe one of the reasons for the issues with interactions stems out of the fact that as we know, a regression model "fixes" the other parameters at 0. This corresponds to the mean, when a standardized dataset is passed.

Let's say we have the interaction between x * y. The coefficient corresponding to "x" is the coefficient of x when y = 0. This coefficient changes, as y changes (following the interaction coefficient). Now, if a standardized dataset is passed, it is normal that the effect of the "x" parameter is different, as it corresponds to the effect of "x" at the mean of "y" (which might not be the case of unstandardized data).

Hence, my hint is that the posthoc standardization should somehow take the mean of the variables into account in the case of interactions. I have no idea how, though.

To facilitate the exploration, I've refacted parameters_standardize and created the standardize_info function, that returns values useful for parameters standardization, such as deviations of response and variables.

model <- lm(Sepal.Width ~ Petal.Width * Sepal.Length, data = iris)
info <- parameters::standardize_info(model)
info$Refit <- parameters::parameters_standardize(model, method = "refit")[, 2]
info$Raw <- insight::get_parameters(model)[, 2]
info[sapply(info, is.numeric)] <- sapply(info[sapply(info, is.numeric)], round, digits = 1)
info
#>                  Parameter        Type Factor Deviation_Response
#> 1              (Intercept)   intercept   <NA>                0.4
#> 2              Petal.Width     numeric  FALSE                0.4
#> 3             Sepal.Length     numeric  FALSE                0.4
#> 4 Petal.Width:Sepal.Length interaction  FALSE                0.4
#>   Mean_Response Deviation_Classic Mean_Classic Deviation_Smart Mean_Smart
#> 1           3.1               0.0          0.0             0.0        0.0
#> 2           3.1               0.8          1.2             0.8        1.2
#> 3           3.1               0.8          5.8             0.8        5.8
#> 4           3.1               5.3          7.5             0.8        5.8
#>   Refit  Raw
#> 1  -0.2  3.4
#> 2  -0.7 -1.5
#> 3   0.4  0.0
#> 4   0.3  0.2

Created on 2019-10-08 by the reprex package (v0.3.0)

In a nutshell, the problem is to try to FIND the "Refit" column from the "Raw" column using the remaining information...

DominiqueMakowski commented 4 years ago

At the same time, it suggests that the issue with interaction are not real issues, it's just that the estimates correspond to something different, but they are not wrong per se (I think)

DominiqueMakowski commented 4 years ago

I reckon that's the reason for partial standardized coefficients (https://www.jstor.org/stable/2684719), but we need VIF for that

mattansb commented 3 years ago

@DominiqueMakowski Can you explain what exactly the "smart" method is trying to do? It seems to break somewhat when there are formula-transformations (log(y) ~ sqrt(x)) and I'd like to try and fix that, but I don't know what I'm actually trying to achieve 😅 (Currently, methods "refit" and "basic" are the most stable, but that is only because I know what they're suppose to return...)

(I cases with transformations it will never be equal to method "refit" because the parameters themselves are estimated differently)

(As you mention above, this is also the issue with interactions - the centering changes the simple/conditional slope parameters, so it also cannot be the same as with "refit", no?)

So broadly asking, what would method "smart" do to this model:

log(y) ~ sqrt(x) * some_factor

Also, what exactly is the conceptual difference between methods "smart" and "posthoc"?

DominiqueMakowski commented 3 years ago

TLDR;

So broadly asking, what would method "smart" do to this model:

No idea

Longer story:

so in simpler models basic and refit are equivalent. But for more complex models (especially with interactions, transformations etc) basic starts to depart from refit. IMO, refit gives the "gold standard" results, because they don't involve any posthoc transformation. However, the problem is that "refit" is computationally heavy (Bayes). So the goal of "smart" is to be a posthoc method (that does not refit the model from scratch but simply transforms the parameters with the info it has), but that gives the same results as "refit".

So in simple cases, there should be no difference between the 3 methods basic, refit and smart. In more complex cases, smart aims at giving the same results as "refit".

Basically "smart" is supposed to be "refit" minus the model refitting.

So to get back to your initial question of what the result should be in that particular case, the expected result should be the same as the one given by "refit"...


About how it works, basically standardize() doesn't change factors, which are still booleans in the model matrix. So in order to adjust the parameters to mimic the "refit" method, for example sone must know if a given parameters refer to a continuous (in which case the parameter must be scaled by both the sd of the outcome and the predictor) or factor (only the outcome). I reckon there might be other cases where something particular should be made (reverse transformation when transformation are specified?)

DominiqueMakowski commented 3 years ago

but we can move slowly here, it doesn't need to be perfect from scratch, smart can always be equivalent to "basic" when we don't know how to retrieve the "refit"-like coefs. Basically it's like "basic+" method, i.e., it's fast, and will work in most cases, and in some cases it will need you the straightforward "basic" standardization