Open jmgirard opened 2 weeks ago
Have you checked with car::VIF()
?
car returns an error:
x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)
fit <- lm(y ~ poly(x, degree = 2), data = df)
performance::check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#> multicollinearity.
#> NULL
car::vif(fit)
#> Error in vif.default(fit): model contains fewer than 2 terms
Created on 2024-06-12 with reprex v2.1.0
When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?
I think that makes sense - there is only 1 underlying variable, and adding orthogonal functions of the variable won't have any collinearity between them.
When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?
I think that makes sense - there is only 1 underlying variable, and adding orthogonal functions of the variable won't have any collinearity between them.
They won't have collinearity and they are powered by a single variable, but technically won't there be one term per degree? And in some weird world, couldn't it be possible for the different terms to have different VIFs in a model with additional variables?
Also, note that this same behavior occurs even when using raw polynomials, which definitely could be collinear:
x <- runif(n = 100, min = 100, max = 110)
y <- 0.2 + 1.2 * x + rnorm(100, 0, 1)
df <- data.frame(x, y)
fit <- lm(y ~ poly(x, degree = 2, raw = TRUE), data = df)
performance::check_collinearity(fit)
#> Not enough model terms in the conditional part of the model to check for
#> multicollinearity.
#> NULL
car::vif(fit)
#> Error in vif.default(fit): model contains fewer than 2 terms
Created on 2024-06-12 with reprex v2.1.0
poly is considered one term - like factors or matrices:
data("mtcars")
mod <- lm(mpg ~ cbind(disp, hp) + poly(qsec, 2) + factor(cyl),
data = mtcars)
insight::find_terms(mod)
#> $response
#> [1] "mpg"
#>
#> $conditional
#> [1] "cbind(disp, hp)" "poly(qsec, 2)" "factor(cyl)"
anova(mod)
#> Analysis of Variance Table
#>
#> Response: mpg
#> Df Sum Sq Mean Sq F value Pr(>F)
#> cbind(disp, hp) 2 842.55 421.28 51.7367 1.3e-09 ***
#> poly(qsec, 2) 2 11.95 5.97 0.7337 0.49020
#> factor(cyl) 2 67.98 33.99 4.1742 0.02728 *
#> Residuals 25 203.57 8.14
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
performance::check_collinearity(mod)
#> # Check for Multicollinearity
#>
#> Low Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> poly(qsec, 2) 4.39 [ 2.91, 6.99] 2.09 0.23 [0.14, 0.34]
#>
#> Moderate Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> factor(cyl) 9.45 [ 5.99, 15.31] 3.07 0.11 [0.07, 0.17]
#>
#> High Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> cbind(disp, hp) 23.03 [14.25, 37.63] 4.80 0.04 [0.03, 0.07]
Created on 2024-06-13 with reprex v2.1.0
If you want each feature to be a term, you will need to hand code them.
mod <- lm(mpg ~ disp + hp + qsec + I(qsec^2) + I(cyl==6) + I(cyl==8),
data = mtcars)
insight::find_terms(mod)
#> $response
#> [1] "mpg"
#>
#> $conditional
#> [1] "disp" "hp" "qsec" "I(qsec^2)" "I(cyl == 6)"
#> [6] "I(cyl == 8)"
anova(mod)
#> Analysis of Variance Table
#>
#> Response: mpg
#> Df Sum Sq Mean Sq F value Pr(>F)
#> disp 1 808.89 808.89 99.3390 3.428e-10 ***
#> hp 1 33.67 33.67 4.1344 0.05277 .
#> qsec 1 6.71 6.71 0.8235 0.37282
#> I(qsec^2) 1 5.24 5.24 0.6438 0.42990
#> I(cyl == 6) 1 59.13 59.13 7.2616 0.01241 *
#> I(cyl == 8) 1 8.85 8.85 1.0867 0.30718
#> Residuals 25 203.57 8.14
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
performance::check_collinearity(mod)
#> # Check for Multicollinearity
#>
#> Low Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> I(cyl == 6) 2.09 [ 1.53, 3.25] 1.45 0.48 [0.31, 0.65]
#>
#> Moderate Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> disp 7.88 [ 5.03, 12.72] 2.81 0.13 [0.08, 0.20]
#> hp 9.03 [ 5.74, 14.62] 3.01 0.11 [0.07, 0.17]
#>
#> High Correlation
#>
#> Term VIF VIF 95% CI Increased SE Tolerance Tolerance 95% CI
#> qsec 348.67 [212.31, 573.00] 18.67 2.87e-03 [0.00, 0.00]
#> I(qsec^2) 320.53 [195.20, 526.75] 17.90 3.12e-03 [0.00, 0.01]
#> I(cyl == 8) 11.38 [ 7.17, 18.48] 3.37 0.09 [0.05, 0.14]
Created on 2024-06-13 with reprex v2.1.0
If I modified the function to do the following, would you accept the PR?
I()
and output collinearityI don't think it makes sense to treat polynomials as being more than one variable/term. The collinearity of polynomial terms is non-essential, so VIF diagnostics don't make much sense for them. Collinearity isn't diagnostic of model issues for polynomial terms (raw or not)--the use of orthogonal polynomials is for computational reasons, not model validity reasons.
That makes sense to me. What about just a message when giving it only a poly term that explains this?
For example:
check_collinearity(lm(y ~ poly(x, 2), data = df))
#> Message: Polynomial transformations are considered a single term and thus VIFs are not calculated between them.
I think a note in the details section of the function documentation should suffice to avoid a very verbose messaging
Related to https://github.com/easystats/see/issues/346
I know that orthogonal polynomials won't be collinear (that's their point). But I want to be able to show that while teaching.
Created on 2024-06-12 with reprex v2.1.0
When other variables are included, there is no error but the orthogonal polynomials are considered a single term. Is that correct?
Created on 2024-06-12 with reprex v2.1.0