Alternative R2? - Githubissues

mattansb commented 2 years ago

Problem

R2 is affected by the absence of an intercept:

data <- mtcars |> 
  within({
    M = model.matrix(~1 + hp)
  })

m1 <- lm(mpg ~ 1 + hp, data = data)
m2 <- lm(mpg ~ 0 + M, data = data) # same model?

Same parameters…

parameters::compare_parameters(m1, m2)
#> Warning in mapply(function(.d, .l) {: longer argument not a multiple of length
#> of shorter
#> Warning: Some model terms could not be found in model data.
#>   You probably need to load the data into the environment.
#> Parameter    |                   m1 |                   m2
#> ----------------------------------------------------------
#> (Intercept)  | 30.10 (26.76, 33.44) |                     
#> hp           | -0.07 (-0.09, -0.05) |                     
#> M(Intercept) |                      | 30.10 (26.76, 33.44)
#> Mhp          |                      | -0.07 (-0.09, -0.05)
#> ----------------------------------------------------------
#> Observations |                   32 |                   32

Same dfs…

df.residual(m1)
#> [1] 30
df.residual(m2)
#> [1] 30

Same predictions…

all.equal(predict(m1), predict(m2))
#> [1] TRUE

But…

performance::r2(m1)
#> # R2 for Linear Regression
#>        R2: 0.602
#>   adj. R2: 0.589
performance::r2(m2)
#> # R2 for Linear Regression
#>        R2: 0.968
#>   adj. R2: 0.966

This isn’t an issue in performace, but R seems to give different values based on whether or not there is an intercept in the model

summary(m1)$r.squared
#> [1] 0.6024373
summary(m2)$r.squared
#> [1] 0.9681196

^{Created on 2022-10-14 by the reprex package (v2.0.1)}

Solution

Perhaps we can add an r2_prediction which computes the R2 values based on the prediction?

Either as: $R^2 = r_{\hat{y},y}^2$

Or (my preferred method) as $R^2 = 1 - \frac{Var(y - \hat{y})}{Var(y)}$

This will allow for correct R2 for transformed outcomes (using smearing), for non-linear models, etc...

strengejacke commented 2 years ago

Can you check what r2.default() returns?

bwiernik commented 2 years ago

R is (correctly) adjusting its computation of the predicted sum of squares based on whether there's an intercept or not.

We shouldn't expect these two results to be the same because your model has an intercept in effect even though you told R it didn't.


r <- z$residuals
f <- z$fitted.values

mss <- if (attr(z$terms, "intercept"))
            sum((f - mean(f))^2) else sum(f^2)
        rss <- sum(r^2)

if (p != attr(z$terms, "intercept")) {
    df.int <- if (attr(z$terms, "intercept")) 1L else 0L
    ans$r.squared <- mss/(mss + rss)
    ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)
    ans$fstatistic <- c(value = (mss/(p - df.int))/resvar,
                numdf = p - df.int, dendf = rdf)
    } else ans$r.squared <- ans$adj.r.squared <- 0

mattansb commented 2 years ago

What is the justification for this formula @bwiernik ? I haven't come across it before...

bwiernik commented 2 years ago

See this great StackExchange answer https://stats.stackexchange.com/a/26205/364001

easystats / performance

Alternative R2? #491

Problem

Solution