Computation of SE of predictions for GLMs

DominiqueMakowski commented 3 years ago

https://github.com/easystats/insight/blob/5b50279100ea10e276c775de5b2ca6bceb309f84/tests/testthat/test-get_predicted.R#L88-L93

Currently it seems like the CIs for GLMs are not correct (we can test against ciTools):

    x <- glm(vs ~ wt, data = mtcars, family = "binomial")
    head(ciTools::add_ci(mtcars, x)[12:14])
#>                        pred  LCB0.025  UCB0.975
#> Mazda RX4         0.6701904 0.4029645 0.8595094
#> Mazda RX4 Wag     0.5552378 0.3323882 0.7578847
#> Datsun 710        0.7828259 0.4680441 0.9365779
#> Hornet 4 Drive    0.3946687 0.2138805 0.6097431
#> Hornet Sportabout 0.2978405 0.1368286 0.5316274
#> Valiant           0.2899116 0.1307532 0.5256495
    ciTools::add_pi(mtcars, x)
#> Error in add_pi.glm(mtcars, x): Prediction intervals for Bernoulli response variables aren't useful

^{Created on 2021-03-03 by the reprex package (v0.3.0)}

I think that the reason could be related to the SE computation:

x <- lm(mpg ~ cyl + hp, data = mtcars)

# Same for lm
as.numeric(predict.lm(x, se.fit = TRUE, interval = "confidence")$se.fit)
#>  [1] 0.7281647 0.7281647 0.9279509 0.7281647 0.9200310 0.7777664 1.0080670
#>  [8] 0.9225132 0.9362657 0.6234320 0.6234320 0.8862558 0.8862558 0.8862558
#> [15] 0.8057154 0.8206255 0.8911693 0.9099596 0.9695585 0.9127445 0.9454598
#> [22] 1.1490264 1.1490264 1.0080670 0.9200310 0.9099596 0.9205392 1.0474287
#> [29] 1.2011595 0.7635547 2.1153615 1.0175965
as.numeric(insight:::.get_predicted_ci_se(x, ci_type = "confidence"))
#>  [1] 0.7281647 0.7281647 0.9279509 0.7281647 0.9200310 0.7777664 1.0080670
#>  [8] 0.9225132 0.9362657 0.6234320 0.6234320 0.8862558 0.8862558 0.8862558
#> [15] 0.8057154 0.8206255 0.8911693 0.9099596 0.9695585 0.9127445 0.9454598
#> [22] 1.1490264 1.1490264 1.0080670 0.9200310 0.9099596 0.9205392 1.0474287
#> [29] 1.2011595 0.7635547 2.1153615 1.0175965

# Different for glm
x <- glm(vs ~ wt * mpg, data = mtcars, family = "binomial")

as.numeric(predict(x, se.fit = TRUE, interval = "confidence")$se.fit)
#>  [1] 0.6783241 0.5675334 0.7288259 0.8216023 0.5475987 0.5684152 1.4616734
#>  [8] 1.5935019 1.1005075 0.5838378 0.5987930 1.0951394 0.7396509 1.2077725
#> [15] 4.3883624 4.6493932 2.8604836 2.0769431 1.6275303 2.2131046 0.7481209
#> [22] 1.1116877 1.2202313 1.8188407 0.9722454 0.9850579 0.8143193 1.7414268
#> [29] 1.1961934 0.7183382 1.2496966 0.5858323
as.numeric(insight:::.get_predicted_ci_se(x, ci_type = "confidence"))
#>  [1] 0.7610616 0.6367575 0.8177234 0.9218160 0.6143912 0.6377467 1.6399588
#>  [8] 1.7878668 1.2347402 0.6550505 0.6718298 1.2287173 0.8298686 1.3550887
#> [15] 4.9236261 5.2164957 3.2093867 2.3302750 1.8260458 2.4830446 0.8393718
#> [22] 1.2472841 1.3690671 2.0406909 1.0908335 1.1052088 0.9136446 1.9538347
#> [29] 1.3420973 0.8059564 1.4021264 0.6572883

^{Created on 2021-03-03 by the reprex package (v0.3.0)}

Looking at the source code; predict.glm uses directly predict.lm... but we have correct results for predict.lm so I'm not sure why it then different for GLMs 🤔