ASKurz / Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed

The bookdown version lives here:
https://bookdown.org/content/4857/
Creative Commons Zero v1.0 Universal
125 stars 38 forks source link

R2_is_bad() uses internal function var2() and has the residual reversed #30

Closed rpruim closed 2 years ago

rpruim commented 3 years ago

These both match what is in the text, but I think they are bad ideas.

ASKurz commented 3 years ago

The var2() fix is easy. We can just formally introduce McElreath's definition from the rethinking GitHub repo:

var2 <- function( x , na.rm=TRUE ) {
    # use E(x^2) - E(x)^2 form
    mean(x^2) - mean(x)^2
}

Please say more about your second bullet.

rpruim commented 3 years ago

There isn't really a reason to use var2(). It is just dividing by n instead of by n - 1 (hidden in the means); those denominators cancel in either case. This seems like a total distractor to the conversation. One could argue in favor of using sums of squares instead of variance to avoid the denominator, but I don't see any real reason for preferring var2().

The fact that McElreath documents this function as internal and not intended for users suggest that this was just an oversight on his part.

The standard definition of a residual is $y_i - \hat{y}_i$ = observed response - predicted response. Since Var(-X) = Var(X), it doesn't matter for the computation, but it is confusing to students who know how a residual should be defined. I would define the function as


R2_is_bad <- function(quap_fit, seed = 7, ...) {
  set.seed(seed)
  l <- link(quap_fit, refresh = 0, ...)
  r <- Brains$brain_std - apply(l, 2, mean) 
  1 - var(r) / var(Brains$brain_std)
}
ASKurz commented 2 years ago

I'll meet you half way. Since the var2() issue seems like a distraction from the larger issue, I'm just going to leave it in. I do think you've made a good point about using the standard definition of a residual, though. Thus, I'm going to update the formula to

R2_is_bad <- function(brm_fit, seed = 7, ...) {

  set.seed(seed)
  p <- predict(brm_fit, summary = F, ...)
  # in my experience, it's more typical to define residuals as the criterion minus the predictions
  r <- d$brain_std - apply(p, 2, mean)
  1 - rethinking::var2(r) / rethinking::var2(d$brain_std)

}
ASKurz commented 2 years ago

Done

rpruim commented 2 years ago

@ASKurz, I would NOT do this. There is no guarantee that rethinking will continue to export var2() and it is undocumented, except for documentation indicating it is only intended for internal use. The use of :: doesn't really help with either of these problems.

If you want to keep using var2() (which I would also not since var() will work just fine here and avoid all these problems), I would suggest that you define it in your code -- pretending it does not exist in the rethinking package -- so that it is clear what it is and doesn't rely on something not intended for public use. I thought that is what you meant in your post from April 3.

rpruim commented 2 years ago

In your code comment, did you mean "criterion"? I'm not familiar with using that term in this context.

ASKurz commented 2 years ago

Yep. By "criterion," I mean the focal variable in the regression analysis. The DV, y variable, or response variable, if you will.