AndriSignorell / DescTools

Tools for Descriptive Statistics and Exploratory Data Analysis
http://andrisignorell.github.io/DescTools/
86 stars 18 forks source link

PseudoR2 McFadden: wrong values for (successes/failures) binomial GLMs #127

Open MarcRieraDominguez opened 1 year ago

MarcRieraDominguez commented 1 year ago

Hi! Congratluations on the great package! I have come across an unexpected behaviour of the PseudoR2 function: it returns a wrong value of McFadden's Pseudo-R2 for binomial GLMs that model successes out of failures. It appears to work fine with binomial GLMs that model binary responses. I provide a reprex with overdispersed binomial GLM, it shows the same warning as my own data which are not overdispersed: In res["Tjur"] <- unname(diff(tapply(y.hat.resp, y, mean, na.rm = TRUE))) : number of items to replace is not a multiple of replacement length Previous reported issues with PseudoR2 did not deal with binomial GLMs explicitly, as far as I can tell: https://github.com/AndriSignorell/DescTools/issues/19 Thank you for your time!!

library(DescTools)
library(DHARMa)
#> Warning: package 'DHARMa' was built under R version 4.2.3
#> This is DHARMa 0.4.6. For overview type '?DHARMa'. For recent changes, type news(package = 'DHARMa')
library(faraway)
#> Warning: package 'faraway' was built under R version 4.2.2
library(performance)
#> Warning: package 'performance' was built under R version 4.2.3

# Binomial GLM: k success out of K trials
data(troutegg, package="faraway") # Survival of trout eggs
m1 <- glm(cbind(survive,total-survive) ~ location+period, family=binomial, troutegg)
summary(m1)
#> 
#> Call:
#> glm(formula = cbind(survive, total - survive) ~ location + period, 
#>     family = binomial, data = troutegg)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -4.8305  -0.3650  -0.0303   0.6191   3.2434  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   4.6358     0.2813  16.479  < 2e-16 ***
#> location2    -0.4168     0.2461  -1.694   0.0903 .  
#> location3    -1.2421     0.2194  -5.660 1.51e-08 ***
#> location4    -0.9509     0.2288  -4.157 3.23e-05 ***
#> location5    -4.6138     0.2502 -18.439  < 2e-16 ***
#> period7      -2.1702     0.2384  -9.103  < 2e-16 ***
#> period8      -2.3256     0.2429  -9.573  < 2e-16 ***
#> period11     -2.4500     0.2341 -10.466  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 1021.469  on 19  degrees of freedom
#> Residual deviance:   64.495  on 12  degrees of freedom
#> AIC: 157.03
#> 
#> Number of Fisher Scoring iterations: 5

# Overdispersed binomial (depending on who you ask)
DHARMa::testDispersion(m1, alternative = "greater")

#> 
#>  DHARMa nonparametric dispersion test via sd of residuals fitted vs.
#>  simulated
#> 
#> data:  simulationOutput
#> dispersion = 1.0411, p-value = 0.144
#> alternative hypothesis: greater
performance::check_overdispersion(m1)
#> # Overdispersion test
#> 
#>        dispersion ratio =   5.330
#>   Pearson's Chi-Squared =  63.964
#>                 p-value = < 0.001
#> Overdispersion detected.

PseudoR2(m1, which = "McFadden")
#> Warning in res["Tjur"] <- unname(diff(tapply(y.hat.resp, y, mean, na.rm =
#> TRUE))): número de items para para sustituir no es un múltiplo de la longitud
#> del reemplazo
#>  McFadden 
#> 0.9989086
1 - logLik(m1)/logLik(update(m1, . ~ 1))
#> 'log Lik.' 0.8715584 (df=8)

# Binomial GLM: 0 vs 1 (dying vs surviving the Titanic's maiden voyage)
m2 <- glm(Survived ~ ., data=Untable(Titanic), family=binomial)
summary(m2)
#> 
#> Call:
#> glm(formula = Survived ~ ., family = binomial, data = Untable(Titanic))
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.0812  -0.7149  -0.6656   0.6858   2.1278  
#> 
#> Coefficients:
#>             Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   0.6853     0.2730   2.510   0.0121 *  
#> Class2nd     -1.0181     0.1960  -5.194 2.05e-07 ***
#> Class3rd     -1.7778     0.1716 -10.362  < 2e-16 ***
#> ClassCrew    -0.8577     0.1573  -5.451 5.00e-08 ***
#> SexFemale     2.4201     0.1404  17.236  < 2e-16 ***
#> AgeAdult     -1.0615     0.2440  -4.350 1.36e-05 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 2769.5  on 2200  degrees of freedom
#> Residual deviance: 2210.1  on 2195  degrees of freedom
#> AIC: 2222.1
#> 
#> Number of Fisher Scoring iterations: 4

PseudoR2(m2, which = "McFadden")
#>  McFadden 
#> 0.2019875
1 - logLik(m2)/logLik(update(m2, . ~ 1))
#> 'log Lik.' 0.2019875 (df=6)

sessionInfo()
#> R version 4.2.0 (2022-04-22 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 22000)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=Spanish_Spain.utf8  LC_CTYPE=Spanish_Spain.utf8   
#> [3] LC_MONETARY=Spanish_Spain.utf8 LC_NUMERIC=C                  
#> [5] LC_TIME=Spanish_Spain.utf8    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] performance_0.10.5 faraway_1.0.8      DHARMa_0.4.6       DescTools_0.99.45 
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.10       nloptr_2.0.1      cellranger_1.1.0  compiler_4.2.0   
#>  [5] highr_0.9         class_7.3-20      tools_4.2.0       lme4_1.1-29      
#>  [9] boot_1.3-28       digest_0.6.29     nlme_3.1-157      evaluate_0.15    
#> [13] lifecycle_1.0.3   rootSolve_1.8.2.3 lattice_0.20-45   rlang_1.1.1      
#> [17] reprex_2.0.2      Matrix_1.4-1      cli_3.6.1         rstudioapi_0.13  
#> [21] yaml_2.3.5        mvtnorm_1.1-3     expm_0.999-6      xfun_0.40        
#> [25] fastmap_1.1.0     e1071_1.7-9       httr_1.4.3        withr_2.5.0      
#> [29] stringr_1.5.0     knitr_1.39        fs_1.5.2          vctrs_0.6.3      
#> [33] gld_2.6.4         grid_4.2.0        glue_1.6.2        data.table_1.14.2
#> [37] R6_2.5.1          readxl_1.4.0      lmom_2.8          rmarkdown_2.14   
#> [41] minqa_1.2.4       magrittr_2.0.3    splines_4.2.0     htmltools_0.5.6  
#> [45] MASS_7.3-57       insight_0.19.5    Exact_3.1         stringi_1.7.6    
#> [49] proxy_0.4-26

Created on 2023-10-04 with reprex v2.0.2

AndriSignorell commented 10 months ago

@MarcRieraDominguez this is a tricky problem. Note that the loglikelihood for exactly the same model is not the same when you use the cbind frequency form as when you base it on the original data. As the PseudoR2 relies on the loglikelihood it yields to different results in dependency of the formulation of the model.

library(DescTools)
library(faraway)

# Binomial GLM: k success out of K trials
data(troutegg, package="faraway") # Survival of trout eggs
m1 <- glm(cbind(survive,total-survive) ~ location+period, family=binomial, troutegg)
summary(m1)

PseudoR2(m1)
1 - logLik(m1)/logLik(update(m1, . ~ 1))

d.set <- DescTools::Untable(xtabs( cbind(survive, total-survive) ~ location + period, troutegg))
colnames(d.set)[3] <- "survive"
levels(d.set$survive) <- c("yes", "no")
d.set$survive <- relevel(d.set$survive, ref = "no")

m2 <- glm(survive ~ location+period, family=binomial, d.set)
summary(m2)

PseudoR2(m2)
1 - logLik(m2)/logLik(update(m2, . ~ 1))

Nevertheless we will look into this issue!

MarcRieraDominguez commented 10 months ago

Thank you for your reply! I had no idea the log-likelihood could depend on model formulation, good to know!