easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
1.01k stars 87 forks source link

ICC for `beta_family` not accurate #742

Closed roaldarbol closed 2 months ago

roaldarbol commented 3 months ago

The ICC generated from icc() for a glmmTMB model with a beta_family() family seems suspiciously low now. Compared to either the variance decomposition of a similar Bayesian model, or the same model specified with a gaussian() family (the Bayesian and Gaussian seem quite in agreement).

So this is kind of a continuation of https://github.com/easystats/insight/issues/664, but decided to create a new issue rather than re-open, as there seems to be an extra issue with the bootstrapping in icc(). Maybe the new {insight} developments have not made their way into the bootstrapping, IDK. Currently, the estimate is fery low, but the CI is quite high, so the estimate falls well outside the CI.

The reprex is run on the latest commit into main in {performance}.

# Versions from latest commit to {performance}
packageVersion("insight")
#> [1] '0.20.1.11'
packageVersion("performance")
#> [1] '0.12.0.4'

Here's a reprex:

library(tibble)
library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.21.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#> 
#> Attaching package: 'brms'
#> The following object is masked from 'package:glmmTMB':
#> 
#>     lognormal
#> The following object is masked from 'package:stats':
#> 
#>     ar
library(performance)
library(ggplot2)
library(paletteer)
library(forcats)
df_ratio_episode <- tibble::tibble(
  animal_id = factor(
    rep(
      c(
        "208", "209", "210", "214", "223", "228", "211", "213", "217", "222", "234",
        "241", "216", "230", "231", "240", "242", "244", "218", "220", "225", "237",
        "239", "219", "251", "252", "253", "254"
      ),
      each = 2L
    ),
    levels = c(
      "200", "204", "205", "206", "215", "224", "208", "209", "210", "214", "223",
      "228", "211", "213", "217", "222", "234", "241", "216", "230", "231", "240",
      "242", "244", "218", "220", "225", "237", "239", "245", "219", "236", "251",
      "252", "253", "254"
    )
  ),
  trial = rep(c(1, 2), 28),
  activity_ratio = c(
    0.1313027016689785, 0.08387917431645128, 0.1395420340967623,
    0.09844057594710427, 0.19414443290359096, 0.16304581176275632,
    0.17274983272168504, 0.17357956037939837, 0.09729583968716982,
    0.05138063319955499, 0.14298075594540044, 0.10179701101266003,
    0.09168390375802275, 0.11591243874797318, 0.2521345405747349,
    0.16335726666875724, 0.13436311090275369, 0.12012636336085161,
    0.13868852567209072, 0.12008249718946021, 0.27708418835127824,
    0.22042035159734397, 0.2649703945513039, 0.22158610629846917,
    0.2001770607989554, 0.2238562351804714, 0.1105503693420828,
    0.08255349183783911, 0.21927303214082697, 0.22211274055043914,
    0.10446530203550744, 0.11336175801811256, 0.0826812722435201,
    0.09328851878674252, 0.13701773797551595, 0.1297098120849381,
    0.05986226055235673, 0.14423247009476106, 0.19474645802355026,
    0.1713563584485577, 0.25663498351317365, 0.30249307043720924,
    0.09082761877930186, 0.10402396536249521, 0.21941679494558652,
    0.28459112981037343, 0.11218161441362348, 0.12449715062493952,
    0.18427917423975973, 0.14845015830783756, 0.19444224064643065,
    0.13471565660441723, 0.11247341287367296, 0.08660523675310272,
    0.1763980204528711, 0.1049572229068965
  ),
) |>
  dplyr::group_by(animal_id)

# Clearly quite a lot of the variance comes from Animal ID
df_ratio_episode |> 
  ggplot(aes(x = fct_reorder(animal_id, activity_ratio, .desc = TRUE), 
             y = activity_ratio)) +
  geom_line(aes(group = animal_id), alpha = 0.5) +
  geom_point(aes(colour = factor(trial))) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_colour_paletteer_d("nationalparkcolors::Badlands") +
  labs(x = "Animal ID",
       y = "Activity Ratio") +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_blank()
  )


# Fit beta GLMM with glmmTMB
glmm_ratio_beta <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
                                        data = df_ratio_episode,
                                        family = glmmTMB::beta_family)

performance::icc(glmm_ratio_beta)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.077
#>   Unadjusted ICC: 0.077
performance::icc(glmm_ratio_beta, ci = 0.95, method = "boot")
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.077 [1.000, 1.000]
#>   Unadjusted ICC: 0.077 [0.889, 1.000]

# Same but Gaussian
glmm_ratio_gauss <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
                                       data = df_ratio_episode,
                                       family = gaussian)

performance::icc(glmm_ratio_gauss)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.782
#>   Unadjusted ICC: 0.772
performance::icc(glmm_ratio_gauss, ci = 0.95, method = "boot")
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.782 [0.616, 0.891]
#>   Unadjusted ICC: 0.772 [0.592, 0.886]

# Meanwhile the Bayesian equivalent gives a high variance ratio (and the ICC is off, but I think that is somewhat expected right?)
bayes_ratio_episode <- brms::brm(activity_ratio ~ trial + (1 | animal_id),
                                       data = df_ratio_episode,
                                       family = brms::Beta())
#> Compiling Stan program...
#> Trying to compile a simple C file
#> Running /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/bin/R \
#>   CMD SHLIB foo.c
#> using C compiler: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
#> using SDK: ‘’
#> clang -arch x86_64 -I"/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/include" -DNDEBUG   -I"/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library/Rcpp/include/"  -I"/Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/"  -I"/Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/unsupported"  -I"/Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/BH/include" -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/src/"  -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/"  -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/RcppParallel/5.1.7/a45594a00f5dbb073d5ec9f48592a08a/RcppParallel/include/"  -I"/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/rstan/2.32.6/8a5b5978f888a3477c116e0395d006f8/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DUSE_STANC3 -DSTRICT_R_HEADERS  -DBOOST_PHOENIX_NO_VARIADIC_EXPRESSION  -D_HAS_AUTO_PTR_ETC=0  -include '/Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/R/x86_64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c foo.c -o foo.o
#> In file included from <built-in>:1:
#> In file included from /Users/roaldarbol/Library/Caches/org.R-project.R/R/renv/cache/v5/macos/R-4.4/x86_64-apple-darwin20/StanHeaders/2.32.9/affbfba7203fd83b6594362f1355c25b/StanHeaders/include/stan/math/prim/fun/Eigen.hpp:22:
#> In file included from /Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/Eigen/Dense:1:
#> In file included from /Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/Eigen/Core:19:
#> /Volumes/ResearchSSD/002-Tracking-1Day/.renv/library/macos/R-4.4/x86_64-apple-darwin20/RcppEigen/include/Eigen/src/Core/util/Macros.h:679:10: fatal error: 'cmath' file not found
#> #include <cmath>
#>          ^~~~~~~
#> 1 error generated.
#> make: *** [foo.o] Error 1
#> Start sampling
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 1).
#> Chain 1: 
#> Chain 1: Gradient evaluation took 8.2e-05 seconds
#> Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.82 seconds.
#> Chain 1: Adjust your expectations accordingly!
#> Chain 1: 
#> Chain 1: 
#> Chain 1: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 1: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 1: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 1: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 1: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 1: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 1: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 1: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 1: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 1: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 1: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 1: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 1: 
#> Chain 1:  Elapsed Time: 0.618 seconds (Warm-up)
#> Chain 1:                0.645 seconds (Sampling)
#> Chain 1:                1.263 seconds (Total)
#> Chain 1: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 2).
#> Chain 2: 
#> Chain 2: Gradient evaluation took 3.7e-05 seconds
#> Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.37 seconds.
#> Chain 2: Adjust your expectations accordingly!
#> Chain 2: 
#> Chain 2: 
#> Chain 2: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 2: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 2: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 2: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 2: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 2: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 2: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 2: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 2: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 2: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 2: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 2: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 2: 
#> Chain 2:  Elapsed Time: 0.628 seconds (Warm-up)
#> Chain 2:                0.412 seconds (Sampling)
#> Chain 2:                1.04 seconds (Total)
#> Chain 2: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 3).
#> Chain 3: 
#> Chain 3: Gradient evaluation took 2.7e-05 seconds
#> Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.27 seconds.
#> Chain 3: Adjust your expectations accordingly!
#> Chain 3: 
#> Chain 3: 
#> Chain 3: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 3: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 3: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 3: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 3: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 3: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 3: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 3: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 3: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 3: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 3: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 3: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 3: 
#> Chain 3:  Elapsed Time: 0.567 seconds (Warm-up)
#> Chain 3:                0.415 seconds (Sampling)
#> Chain 3:                0.982 seconds (Total)
#> Chain 3: 
#> 
#> SAMPLING FOR MODEL 'anon_model' NOW (CHAIN 4).
#> Chain 4: 
#> Chain 4: Gradient evaluation took 2.6e-05 seconds
#> Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.26 seconds.
#> Chain 4: Adjust your expectations accordingly!
#> Chain 4: 
#> Chain 4: 
#> Chain 4: Iteration:    1 / 2000 [  0%]  (Warmup)
#> Chain 4: Iteration:  200 / 2000 [ 10%]  (Warmup)
#> Chain 4: Iteration:  400 / 2000 [ 20%]  (Warmup)
#> Chain 4: Iteration:  600 / 2000 [ 30%]  (Warmup)
#> Chain 4: Iteration:  800 / 2000 [ 40%]  (Warmup)
#> Chain 4: Iteration: 1000 / 2000 [ 50%]  (Warmup)
#> Chain 4: Iteration: 1001 / 2000 [ 50%]  (Sampling)
#> Chain 4: Iteration: 1200 / 2000 [ 60%]  (Sampling)
#> Chain 4: Iteration: 1400 / 2000 [ 70%]  (Sampling)
#> Chain 4: Iteration: 1600 / 2000 [ 80%]  (Sampling)
#> Chain 4: Iteration: 1800 / 2000 [ 90%]  (Sampling)
#> Chain 4: Iteration: 2000 / 2000 [100%]  (Sampling)
#> Chain 4: 
#> Chain 4:  Elapsed Time: 0.536 seconds (Warm-up)
#> Chain 4:                0.442 seconds (Sampling)
#> Chain 4:                0.978 seconds (Total)
#> Chain 4:
performance::variance_decomposition(bayes_ratio_episode)
#> # Random Effect Variances and ICC
#> 
#> Conditioned on: all random effects
#> 
#> ## Variance Ratio (comparable to ICC)
#> Ratio: 0.74  CI 95%: [0.44 0.87]
#> 
#> ## Variances of Posterior Predicted Distribution
#> Conditioned on fixed effects: 0.00  CI 95%: [0.00 0.00]
#> Conditioned on rand. effects: 0.00  CI 95%: [0.00 0.01]
#> 
#> ## Difference in Variances
#> Difference: 0.00  CI 95%: [0.00 0.00]
# performance::icc(bayes_ratio_episode)

# Versions from latest commit to {performance}
packageVersion("insight")
#> [1] '0.20.1.11'
packageVersion("performance")
#> [1] '0.12.0.4'

Created on 2024-07-05 with reprex v2.1.0

strengejacke commented 3 months ago

The variance components for the beta-model don't look that bad, actually.

library(glmmTMB)
library(performance)

df_ratio_episode <- data.frame(
  animal_id = factor(
    rep(
      c(
        "208", "209", "210", "214", "223", "228", "211", "213", "217", "222", "234",
        "241", "216", "230", "231", "240", "242", "244", "218", "220", "225", "237",
        "239", "219", "251", "252", "253", "254"
      ),
      each = 2L
    ),
    levels = c(
      "200", "204", "205", "206", "215", "224", "208", "209", "210", "214", "223",
      "228", "211", "213", "217", "222", "234", "241", "216", "230", "231", "240",
      "242", "244", "218", "220", "225", "237", "239", "245", "219", "236", "251",
      "252", "253", "254"
    )
  ),
  trial = rep(c(1, 2), 28),
  activity_ratio = c(
    0.1313027016689785, 0.08387917431645128, 0.1395420340967623,
    0.09844057594710427, 0.19414443290359096, 0.16304581176275632,
    0.17274983272168504, 0.17357956037939837, 0.09729583968716982,
    0.05138063319955499, 0.14298075594540044, 0.10179701101266003,
    0.09168390375802275, 0.11591243874797318, 0.2521345405747349,
    0.16335726666875724, 0.13436311090275369, 0.12012636336085161,
    0.13868852567209072, 0.12008249718946021, 0.27708418835127824,
    0.22042035159734397, 0.2649703945513039, 0.22158610629846917,
    0.2001770607989554, 0.2238562351804714, 0.1105503693420828,
    0.08255349183783911, 0.21927303214082697, 0.22211274055043914,
    0.10446530203550744, 0.11336175801811256, 0.0826812722435201,
    0.09328851878674252, 0.13701773797551595, 0.1297098120849381,
    0.05986226055235673, 0.14423247009476106, 0.19474645802355026,
    0.1713563584485577, 0.25663498351317365, 0.30249307043720924,
    0.09082761877930186, 0.10402396536249521, 0.21941679494558652,
    0.28459112981037343, 0.11218161441362348, 0.12449715062493952,
    0.18427917423975973, 0.14845015830783756, 0.19444224064643065,
    0.13471565660441723, 0.11247341287367296, 0.08660523675310272,
    0.1763980204528711, 0.1049572229068965
  )
)

glmm_ratio_beta <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
  data = df_ratio_episode,
  family = glmmTMB::beta_family
)
insight::get_variance(glmm_ratio_beta)
#> $var.fixed
#> [1] 0.003054337
#> 
#> $var.random
#> [1] 0.160573
#> 
#> $var.residual
#> [1] 1.92015
#> 
#> $var.distribution
#> [1] 1.92015
#> 
#> $var.dispersion
#> [1] 0
#> 
#> $var.intercept
#> animal_id 
#>  0.160573

glmm_ratio_gauss <- glmmTMB::glmmTMB(activity_ratio ~ trial + (1 | animal_id),
  data = df_ratio_episode,
  family = gaussian
)
insight::get_variance(glmm_ratio_gauss)
#> $var.fixed
#> [1] 4.886653e-05
#> 
#> $var.random
#> [1] 0.00282474
#> 
#> $var.residual
#> [1] 0.0007858734
#> 
#> $var.distribution
#> [1] 0.0007858734
#> 
#> $var.dispersion
#> [1] 0
#> 
#> $var.intercept
#>  animal_id 
#> 0.00282474

Created on 2024-07-05 with reprex v2.1.0

roaldarbol commented 3 months ago

The residual variance seems suspiciously high to me though. 🤔

strengejacke commented 3 months ago

We had this discussion before, but I can't find it right now. There seems to be a mismatch between the variance-function from glmmTMB::beta_family():

> family(mod1)$variance
function (mu)
{
    mu * (1 - mu)
}

and what the docs suggest:

V=μ(1−μ)/(ϕ+1)

(which is mu * (1 - mu) / (1 + phi))

The code base in insight is inconclusive, as well:

# Get distributional variance for beta-family
# ----------------------------------------------
.variance_family_beta <- function(model, mu, phi) {
  stats::family(model)$variance(mu)
  # was:
  # mu * (1 - mu) / (1 + phi)
  # but that code is not what "glmmTMB" uses for the beta family
  # mu * (1 - mu)
}

Tagging @bbolker

strengejacke commented 3 months ago

It's hard to find something to validate against. The current implementation in insight and performance yields results similar to betareg for your example in https://github.com/easystats/insight/issues/664:

library(glmmTMB)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
            value = c(a,b1,b2,b3),
            treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)
m <- glmmTMB::glmmTMB(value ~ treatment,
                              data = d,
                              family=beta_family)
performance::r2_nakagawa(m)
#> Random effect variances not available. Returned R2 does not account for random effects.
#> # R2 for Mixed Models
#> 
#>   Conditional R2: NA
#>      Marginal R2: 0.289

m2 <- betareg::betareg(value ~ treatment, data = d)
summary(m2)$pseudo.r.squared
#> [1] 0.2434683

The next results (including id) differ, seems to be too high for betareg?

m <- glmmTMB::glmmTMB(value ~ treatment + (1 | id),
                              data = d,
                              family=beta_family)
performance::r2_nakagawa(m) # 0.685
m2 <- betareg::betareg(value ~ treatment + id, data = d)
summary(m2)$pseudo.r.squared # 0.940

But... for the example in this issue, using mu * (1 - mu) / (1 + phi) seems to fit better than the current implementation in insight (which is mu * (1 - mu)).

roaldarbol commented 3 months ago

I actually think the betareg version is on point - it helps to plot it out, and here you can really see that the ICC should be super high for the first example!

library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(performance)
library(ggplot2)
library(paletteer)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
                value = c(a,b1,b2,b3),
                treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)

d |> 
  ggplot(aes(x = forcats::fct_reorder(id, value, .desc = TRUE), 
             y = value)) +
  geom_line(aes(group = id), alpha = 0.5) +
  geom_point() +
  theme(panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.text.x = element_blank()
  ) +
  facet_grid(~ treatment)

Created on 2024-07-05 with reprex v2.1.0

roaldarbol commented 3 months ago

Did some more digging. :-) It seems that for the Gaussian family, the conditional R2 of a model without random effects pretty much equates the Adjusted ICC for a random effects model. For the beta family, that's not the case.

library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(performance)
library(brms)
#> Loading required package: Rcpp
#> Loading 'brms' package (version 2.21.0). Useful instructions
#> can be found by typing help('brms'). A more detailed introduction
#> to the package is available through vignette('brms_overview').
#> 
#> Attaching package: 'brms'
#> The following object is masked from 'package:glmmTMB':
#> 
#>     lognormal
#> The following object is masked from 'package:stats':
#> 
#>     ar
library(betareg)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
                value = c(a,b1,b2,b3),
                treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)

# GLMMs and GLMs with beta and Gaussian families
glmm_beta <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                        data = d,
                        family=beta_family)
glm_beta <- betareg(value ~ treatment + id,
                           data = d)
glmm_gauss <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                         data = d,
                         family = gaussian)
glm_gauss <- glm(value ~ treatment + id,
                            data = d,
                            family = gaussian)
icc(glmm_beta)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.619
#>   Unadjusted ICC: 0.511
r2(glmm_beta)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.685
#>      Marginal R2: 0.174
r2(glm_beta)
#> # R2 for Beta Regression
#>   Pseudo R2: 0.941
icc(glmm_gauss)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.995
#>   Unadjusted ICC: 0.749
r2(glmm_gauss)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.996
#>      Marginal R2: 0.247
r2(glm_gauss)
#>   R2: 0.997

Created on 2024-07-05 with reprex v2.1.0

A side-note, might be for a different issue - it seems that variance_decomposition() resembles the Unadjusted ICC rather than the Adjusted ICC (compare the Gaussian examples above and below) - it would be great to have both Adjusted and Unadjusted.

bayes_beta <- brms::brm(value ~ treatment + (1|id),
                       data = d,
                       family=Beta)
#> Compiling Stan program...
#> I just removed all the compiling for convenience....
bayes_beta_fe <- brms::brm(value ~ treatment + id,
                           data = d,
                           family = Beta)
#> Compiling Stan program...
#> I just removed all the compiling for convenience....
variance_decomposition(bayes_beta)
#> # Random Effect Variances and ICC
#> 
#> Conditioned on: all random effects
#> 
#> ## Variance Ratio (comparable to ICC)
#> Ratio: 0.61  CI 95%: [0.58 0.65]
#> 
#> ## Variances of Posterior Predicted Distribution
#> Conditioned on fixed effects: 0.02  CI 95%: [0.02 0.03]
#> Conditioned on rand. effects: 0.06  CI 95%: [0.06 0.06]
#> 
#> ## Difference in Variances
#> Difference: 0.04  CI 95%: [0.03 0.04]
r2(bayes_beta)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.979 (95% CI [0.976, 0.981])
#>      Marginal R2: 0.326 (95% CI [0.311, 0.341])
r2(bayes_beta_fe)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.979 (95% CI [0.976, 0.981])
variance_decomposition(bayes_gauss)
#> # Random Effect Variances and ICC
#> 
#> Conditioned on: all random effects
#> 
#> ## Variance Ratio (comparable to ICC)
#> Ratio: 0.75  CI 95%: [0.74 0.76]
#> 
#> ## Variances of Posterior Predicted Distribution
#> Conditioned on fixed effects: 0.01  CI 95%: [0.01 0.01]
#> Conditioned on rand. effects: 0.06  CI 95%: [0.06 0.06]
#> 
#> ## Difference in Variances
#> Difference: 0.04  CI 95%: [0.04 0.04]
r2(bayes_gauss)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.996 (95% CI [0.996, 0.996])
#>      Marginal R2: 0.247 (95% CI [0.242, 0.252])
r2(bayes_gauss_fe)
#> # Bayesian R2 with Compatibility Interval
#> 
#>   Conditional R2: 0.996 (95% CI [0.996, 0.996])
strengejacke commented 3 months ago

betareg computes the R2 as power of correlation between eta and linkfun(y) (cor(eta, linkfun(y))^2). I'm not sure if this always translates perfect to the mixed models scenario.

bbolker commented 3 months ago

This is definitely a definition of R^2, closest to Efron's pseudo-R^2 (although Efron computes correlation on the response scale before squaring it ...) It would be interesting to see where betareg got that variant from ... I haven't found it in any of my pseudo-R^2 refs (Wikipedia, UCLA, my class notes (although not sure what source I used for the notes — doesn't seem to be in the textbook I used ...))

strengejacke commented 3 months ago

A quick search in the related publication (https://www.jstatsoft.org/article/view/v034i02) has only one match for "R2". Maybe elsewhere it's described in more detail.

May I also point out to this comment: https://github.com/easystats/performance/issues/742#issuecomment-2210936164

What's your opinion on this?

strengejacke commented 3 months ago

Here: https://stats.stackexchange.com/questions/233169/pseudo-r-squared-and-multicollinearity-checks-with-beta-regression

But what would be "the linear predictor for the mean"? Maybe that would help improving insight::get_variance().

Edit: found this image

roaldarbol commented 2 months ago

@strengejacke For the uninitiated, could you just write a line or two about why we talk about R2 almost exclusively when the issue is on ICC - I imagine there's a mathematical connection is implicit, but it's not entirely clear. :-) Also, feel free to change the issue title to reflect that it's an issue that affects both ICC and R2 for the beta family if they're two parts of the same puzzle. 😊

bbolker commented 2 months ago

A bit out of order in the thread, but responding to the comment above about the mismatch between the variance function (mu*(1-mu)) and the conditional variance of the distribution (mu * (1 - mu) / (1 + phi)

? glmmTMB::beta_family says:

variance: a function of either 1 (mean) or 2 (mean and dispersion parameter) arguments giving a value proportional to the predicted variance (scaled by ‘sigma(.)’)

This is for consistency with the way that $variance is implemented/interpreted in stats::glm(). I think I've raised this point before ...

strengejacke commented 2 months ago

Thanks for clarification!

I think I've raised this point before ...

Yes, definitely, it's just I couldn't find the thread/post. :-/

strengejacke commented 2 months ago

@strengejacke For the uninitiated, could you just write a line or two about why we talk about R2 almost exclusively when the issue is on ICC - I imagine there's a mathematical connection is implicit, but it's not entirely clear. :-) Also, feel free to change the issue title to reflect that it's an issue that affects both ICC and R2 for the beta family if they're two parts of the same puzzle. 😊

Sure. ICC is relevant for mixed models only. The ICC is calculated by dividing the random effect variance, σ2i, by the total variance, i.e. the sum of the random effect variance and the residual variance, σ2ε.

The R2 for mixed models are calculated this way:

The main point is that for both ICC and R2 in mixed models, we need the different variance components. That's why both are related.

roaldarbol commented 2 months ago

Thanks for the clarification, that helps.

Just tested out the latest changes - Ferrari's R2 looks more in line with what I'd expect. Will this eventually change the ICC too? A few notes on consistency as of the latest commit:

library(glmmTMB)
#> Warning in checkDepPackageVersion(dep_pkg = "TMB"): Package version inconsistency detected.
#> glmmTMB was built with TMB version 1.9.11
#> Current TMB version is 1.9.14
#> Please re-install glmmTMB from source or restore original 'TMB' package (see '?reinstalling' for more information)
library(performance)
library(betareg)
set.seed(123)
a <- seq(from = 5, to = 95)
b1 <- jitter(a, factor = 20)
b2 <- jitter(a, factor = 20)
b2 <- b2 + 30
b3 <- jitter(a, factor = 20)
b3 <- b3 + 30
t_a <- rep('a', length(a))
t_b <- rep('b', length(a))

c <- as.factor(a)
d <- data.frame(id = c(c,c,c,c),
                value = c(a,b1,b2,b3),
                treatment = c(t_a, t_a, t_b, t_b))
d$value <- d$value / (max(d$value) + 0.1)

# Specify models
glmm_beta <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                              data = d,
                              family=beta_family)
glm_betareg <- betareg(value ~ treatment + id,
                    data = d)
glm_betafamily <- glm(value ~ treatment + id,
                 data = d,
                 family = beta_family)
glmm_gauss <- glmmTMB::glmmTMB(value ~ treatment + (1|id),
                               data = d,
                               family = gaussian)
glm_gauss <- glm(value ~ treatment + id,
                 data = d,
                 family = gaussian)
# ICC
icc(glmm_beta)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.619
#>   Unadjusted ICC: 0.511
icc(glmm_gauss)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.995
#>   Unadjusted ICC: 0.749

# ICC Bootstrapped
icc(glmm_beta, method = "boot", ci = 0.95)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.619 [1.000, 1.000]
#>   Unadjusted ICC: 0.511 [0.660, 0.785]
icc(glmm_gauss, method = "boot", ci = 0.95)
#> # Intraclass Correlation Coefficient
#> 
#>     Adjusted ICC: 0.995 [0.993, 0.996]
#>   Unadjusted ICC: 0.749 [0.693, 0.795]

# R2 - default
r2(glmm_beta)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.685
#>      Marginal R2: 0.174
r2(glm_betareg)
#> # R2 for Beta Regression
#>   Pseudo R2: 0.941
r2(glm_betafamily)
#> # R2 for Generalized Linear Regression
#>   Nagelkerke's R2: NaN
r2(glmm_gauss)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.996
#>      Marginal R2: 0.247
r2(glm_gauss)
#>   R2: 0.997

# R2 - Ferrari
r2_ferrari(glmm_beta)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betareg)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betafamily)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.929

# R2 Bootstrapped
r2(glmm_beta, method = "boot", ci = 0.95)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.685 [1.000, 1.000]
#>      Marginal R2: 0.174 [0.214, 0.326]
r2(glm_betareg, method = "boot", ci = 0.95)
#> # R2 for Beta Regression
#>   Pseudo R2: 0.941
r2(glm_betafamily, method = "boot", ci = 0.95)
#> Error in stats::integrate(.dRsq, i, j, R2_pop = dots$R2_pop, R2_obs = dots$R2_obs, : non-finite function value
r2(glmm_gauss, method = "boot", ci = 0.95)
#> # R2 for Mixed Models
#> 
#>   Conditional R2: 0.996 [0.995, 0.997]
#>      Marginal R2: 0.247 [0.198, 0.302]
r2(glm_gauss, method = "boot", ci = 0.95)
#>   R2: 0.997 [0.995, 0.997]

Created on 2024-07-08 with reprex v2.1.0

strengejacke commented 2 months ago

and the conditional variance of the distribution (mu * (1 - mu) / (1 + phi)

@bbolker: That means, in insight::get_variance(), the and conditional variance of the distribution (mu * (1 - mu)) / (1 + phi) should be used?

bbolker commented 2 months ago

Yes, the conditional variance should definitely be divided by (1+phi)

library(glmmTMB)
set.seed(101)
dd <- data.frame(x=rnorm(100))
## (I was originally planning to use an x covariate, now ignored)
dd$y <- simulate_new(~ 1, family = beta_family, newdata = dd,
           newparams = list(beta = c(1), betadisp = 2))[[1]]
m <- glmmTMB(y~1, family = beta_family, data = dd)
mu <- predict(m, type = "response")
var1 <- family(m)$var(mu)/(1+sigma(m)) ## 0.0187
var(dd$y) ## 0.0193
strengejacke commented 2 months ago

What about orderedbeta family? mu * (1 - mu) or also dividing by (1+phi)?

strengejacke commented 2 months ago

Currently the default R2 for a glmmTMB and glm with beta_family is not Ferrari (IDK what is default for glmm and it's Nagelkerke's R2 for glm - which gives NaN)

It's now implemented for glm (see https://github.com/easystats/insight/commit/887bb59acfd3521131db418ea508838b600b340d). For glmmTMB, r2_ferrari() is only used for non-mixed models. Mixed models always use r2_nakagawa() by default when r2() is called.

The Ferrari R2 is slightly different for a glmmTMB and beta_family compared to betareg

Only if you fit a mixed model for glmmTMB. For non-mixed models, results are identical.

Bootstrapped ICC and R2 for glmmTMB with beta_family is wrong

Yeah, not sure about this one. Will look into it.

Bootstrapping for betareg R2 does nothing. Bootstrapping for glm with beta_family fails

Bootstrapping is not implemented for R2 for non-mixed models.

You should be able to update insight, which now finally is supposed to return the correct ICC/R2 for mixed models from the beta-family.

roaldarbol commented 2 months ago

The Ferrari R2 is slightly different for a glmmTMB and beta_family compared to betareg

Only if you fit a mixed model for glmmTMB. For non-mixed models, results are identical.

Actually the other way around - the mixed model matches the betareg model, the GLM is different. See the above (quoted below).

# R2 - Ferrari
r2_ferrari(glmm_beta)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betareg)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.941
r2_ferrari(glm_betafamily)
#> # R2 for Generalized Linear Regression
#>   Ferrari's R2: 0.929

You should be able to update insight, which now finally is supposed to return the correct ICC/R2 for mixed models from the beta-family.

Works really well now!!! 🥳🥳🥳

strengejacke commented 2 months ago
library(glmmTMB)
set.seed(101)
dd <- data.frame(x=rnorm(100))
## (I was originally planning to use an x covariate, now ignored)
dd$y <- simulate_new(~ 1, family = beta_family, newdata = dd,
           newparams = list(beta = c(1), betadisp = 2))[[1]]
m <- glmmTMB(y~1, family = beta_family, data = dd)
mu <- predict(m, type = "response")
var1 <- family(m)$var(mu)/(1+sigma(m)) ## 0.0187
var(dd$y) ## 0.0193

Two questions: does this work for you? I get:

library(glmmTMB)
set.seed(101)
dd <- data.frame(x=rnorm(100))
## (I was originally planning to use an x covariate, now ignored)
dd$y <- simulate_new(~ 1, family = beta_family, newdata = dd,
           newparams = list(beta = c(1), betadisp = 2))[[1]]
#> Error in eval(family$initialize): y values must be 0 < y < 1

Second is https://github.com/easystats/performance/issues/742#issuecomment-2214719960.

bbolker commented 2 months ago
  1. I think this should work with the devel version, simulation bug fixed in https://github.com/glmmTMB/glmmTMB/pull/1008 2.Hmm. ordbeta uses the same $variance and definition of sigma() as beta (just added a note to ?sigma.glmmTMB): see https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/src/glmmTMB.cpp#L733-L737, but I think the actual conditional variances will not be the same, because the parameterization is relative to the underlying/baseline beta distribution, not counting the modifications created by moving to the ordered-beta specification. You might be able to work this out by hand (by analogy with the variances of ZI models, hurdle models, etc. etc.), or maybe look in Kubinec's paper/ask him if he knows the answer??
roaldarbol commented 2 months ago

For the confidence intervals, it seems that none of the ci_methods work (neither default, boot or analytical).

icc(glmm_beta, ci = 0.95)
icc(glmm_beta, ci = 0.95, method = "boot")
icc(glmm_beta, ci = 0.95, method = "analytical")

(Installed both {performance} and {insight} from the latest commits, https://github.com/easystats/performance/commit/06ed6f11f54c6fd35f952c30885b453f6cc89843 and https://github.com/easystats/insight/commit/d346f9527c3b3eb45283118f44b902806b9b7cd7)

strengejacke commented 2 months ago
  1. I think this should work with the devel version, simulation bug fixed in Simulate new init fix glmmTMB/glmmTMB#1008 2.Hmm. ordbeta uses the same $variance and definition of sigma() as beta (just added a note to ?sigma.glmmTMB): see https://github.com/glmmTMB/glmmTMB/blob/master/glmmTMB/src/glmmTMB.cpp#L733-L737, but I think the actual conditional variances will not be the same, because the parameterization is relative to the underlying/baseline beta distribution, not counting the modifications created by moving to the ordered-beta specification. You might be able to work this out by hand (by analogy with the variances of ZI models, hurdle models, etc. etc.), or maybe look in Kubinec's paper/ask him if he knows the answer??

Thanks! Will open a new issue and close this as resolved for now.

roaldarbol commented 2 months ago

In case anyone tries icc for beta family and thinks they're weird - I think so too, they are quite high, even when you draw two completely unrelated samples from a beta distribution (like 0.4-0.5). Would love to test it thoroughly, but don't have the time currently. But I posted an example that could be used to dig in here.