easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
965 stars 87 forks source link

Revise compare_models() for Bayesian models #716

Open DominiqueMakowski opened 2 months ago

DominiqueMakowski commented 2 months ago

The current output would benefit from some streamlining.

> performance::compare_performance(model, model2, model3)
# Comparison of Model Performance Indices

Name   |   Model |     ELPD | ELPD_SE | LOOIC (weights) | LOOIC_SE | WAIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
-----------------------------------------------------------------------------------------------------------------------
model  | brmsfit | -104.387 |   9.532 |   208.8 (0.092) |   19.063 |  208.8 (<.001) | 0.927 |     0.926 | 0.475 | 0.487
model2 | brmsfit |  -89.212 |  10.543 |   178.4 (<.001) |   21.085 |  178.4 (<.001) | 0.941 |     0.940 | 0.425 | 0.467
model3 | brmsfit |  -64.580 |  11.569 |   129.2 (0.908) |   23.137 |  129.1 (>.999) | 0.958 |     0.957 | 0.353 | 0.363

Suggestions:

  1. Remove RMSE and Sigma by default (not exactly typically useful or what you would expect when comparing models)
  2. Move R2 and R2 adj. first, after the model column
  3. My current understanding is that LOO and WAIC are both methods to estimate ELPD, which can be used to compare models (in particular, by looking at the ELPD-diff (see https://github.com/easystats/report/pull/419 for the report support recently added). As such, in the interest of computation, I would display directly the ELPD_DIFF + (SE), by default computed using LOO (although we could add the option for WAIC for faster computation), and that's it, drop the LOOIC, WAIC and the raw ELPD which are redundant indices and simply add noise.

What do you think?

bwiernik commented 2 months ago

I agree other than that I think we should keep LOOIC+SE (rather than ELPD). LOOIC is -2*ELPD, which puts it in the deviance metric that is also used in frequentist models with AIC, etc, and this allows us to show the stacking weights with our existing formatting functions

DominiqueMakowski commented 2 months ago

Fair So we'd go with ELPD_DIFF (SE); and LOOIC (weights) +SE? Or just the latter? is the SE for LOOIC super necessary since we have the weights?

Should we also add some details in the documentation about how to interpret the weights, because I've been asked by students and I wasn't too clear in my answer... any rules of thumb?

DominiqueMakowski commented 2 months ago

Moreover, in https://github.com/easystats/report/pull/419, we compute the z-transformed ELPD diff (i.e., normalized by SE). But I think we might as well directly convert it to a p-value given that its asymptotic distribution, it would be the most easy and "intuitive" to read (rather than the z-value). ~In order to avoid duplicated code, I suggest we add a parameters::parameters() method for brms::loo_compare() outputs that would simply format and compute the columns we need, we could then use that in report as well as here. I'll open a PR~

EDIT: forget what I said about the duplicated code it wouldn't work

DominiqueMakowski commented 2 months ago

Here is a skeleton, maybe it would make more sense to add it as a method of test_performance() rather than compare_performance() (?)

.test_performance_bayesian <- function(objects, criterion="loo") {

  # Compute differences ----
  if (criterion %in% c("loo", "looic")) {
    diff <- brms::loo_compare(lapply(objects, loo::loo))
    col <- "looic"
  } else if (criterion == "waic") {
    diff <- brms::loo_compare(lapply(objects, loo::waic))
    col <- "waic"
  } else {
    stop("Criterion not supported")
  }

  # Format ----
  out <- as.data.frame(diff)
  out$Name <- rownames(out)
  # Select columns
  out <- out[, c("Name", col, "elpd_diff", "se_diff")]

  # Compute p-value ----
  # ELPD difference is asympatotically normal, hence the z-score is computed
  z <- out$elpd_diff[-1] / out$se_diff[-1]
  out$p <- c(NA, 2 * pnorm(-abs(z)))

  # Return
  out
}

m1 <- brms::brm(wt ~ mpg, data=mtcars, refresh=0)
#> Compiling Stan program...
#> Start sampling
m2 <- brms::brm(wt ~ mpg + hp, data=mtcars, refresh=0)
#> Compiling Stan program...
#> Start sampling
m3 <- brms::brm(wt ~ mpg + hp + qsec, data=mtcars, refresh=0)
#> Compiling Stan program...
#> Start sampling
objects <- insight::ellipsis_info(m1, m2, m3, only_models = TRUE)

.test_performance_bayesian(objects)
#> Warning: Found 1 observations with a pareto_k > 0.7 in model 'X[[i]]'. We
#> recommend to set 'moment_match = TRUE' in order to perform moment matching for
#> problematic observations.

#> Warning: Found 1 observations with a pareto_k > 0.7 in model 'X[[i]]'. We
#> recommend to set 'moment_match = TRUE' in order to perform moment matching for
#> problematic observations.
#>    Name    looic elpd_diff  se_diff          p
#> m3   m3 45.99388  0.000000 0.000000         NA
#> m1   m1 50.78136 -2.393738 2.318327 0.30182476
#> m2   m2 53.35469 -3.680406 2.208913 0.09568128

Created on 2024-04-23 with reprex v2.0.2

bwiernik commented 2 months ago

I suggest we do LOOIC (with weights) and SE. If folks want ELPD or WAIC, they can request those instead.

The SE is useful in the same way as for ELPD, whereas the weights are more like a transformed point estimate.