amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
444 stars 107 forks source link

pool.scalar() returns wrong df (NaN) whenever n = Inf #441

Closed huftis closed 2 years ago

huftis commented 2 years ago

According to the documentation, pool.scalar() will assume an infinite sample (n = Inf) by default. But that doesn’t match the actual behaviour, which results in a degrees of freedom of NaN. Example:

library(mice)
pool.scalar(13:17, 3:7)$df
#> [1] NaN

The expected result would be approx. the df one gets when one uses a very large n, e.g.:

pool.scalar(13:17, 3:7, n = 10^6)$df
#> [1] 28.44315

The bug is caused by the barnard.rubin() function (which pool.scalar() uses internally):

barnard.rubin <- function(m, b, t, dfcom = 999999) {
  lambda <- (1 + 1 / m) * b / t
  lambda[lambda < 1e-04] <- 1e-04
  dfold <- (m - 1) / lambda^2
  dfobs <- (dfcom + 1) / (dfcom + 3) * dfcom * (1 - lambda)
  dfold * dfobs / (dfold + dfobs)
}

When dfcom = Inf, (dfcom + 1) / (dfcom + 3) in the dfobs <- line equals Inf/Inf, which is NaN (not 1), and it is still NaN when multiplied by dfcom * (1 - lambda). It should instead be Inf.

Since the factor dfobs / (dfold + dfobs) in the last line is 1 whenver dfobs is Inf, the correct behaviour would be to just output dfold whenever dfcom is Inf (and perhaps the default value dfcom = 999999 should be changed to dfcom = Inf). For the above example, the resulting value is (exactly) 28.44444…, which is in line with what you get with the large value n = 10^6 (28.44315).

Summary:

stefvanbuuren commented 2 years ago

Thanks for alerting.

As statisticians we may be inclined to think that infinite starts at 1000. You showed that's not quite true. :-)

Now repaired.