amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
433 stars 107 forks source link

Preserve the stochastic nature of mice results (#426) #459

Closed GBA19 closed 1 year ago

GBA19 commented 2 years ago

The change requested in #426 and implemented in #432 has the side-effect of having mice return the exactly same set of imputed values on each instantiation of the function.

The following code example illustrates this (demonstrated with the mean for simplicity):

replicate(5, {imp <- mice(nhanes, m= 1, printFlag = FALSE); complete(imp)$bmi |> mean()} ) # [1] 27.144 27.144 27.144 27.144 27.144

Given the stochastic nature of multiple imputation this seems unreasonable. This may break some existing scripts---it certainly broke one of mine.

There is a simple workaround, adding a runif(1), but it is preferable to have mice work reasonably 'out of the box':

replicate(5, {runif(1); imp <- mice(nhanes, m= 1, printFlag = FALSE, seed= 12345); complete(imp)$bmi |> mean()} ) # [1] 27.032 26.984 26.900 26.900 26.900

If indeed someone needs exact replications on different instantiations, the present behavior in 3.14.0, the solution is as simple as the following:

replicate(5, {runif(1); imp <- mice(nhanes, m= 1, printFlag = FALSE, seed= 12345); complete(imp)$bmi |> mean()} ) # [1] 26.408 26.408 26.408 26.408 26.408

Hopefully, this is something you are able to solve.

"R version 4.1.2 (2021-11-01)" Package Version "mice" "3.14.0" "withr" "2.4.3"

stefvanbuuren commented 2 years ago

Thanks for noting. I wasn't aware of the repetitive nature of the imputes in this case.

If I understand correctly, the problem occurs when no seed argument is given to mice(...) . In that case, withr::local_preserve_seed() initialises from the parent environment. If nothing has changed there, then we will get the same seed over and over again.

Perhaps a simple fix would be to add kick <- runif(1L) in the mice() function, just to kick the random generator, e.g.,

  # set local seed, reset random state generator after function aborts
  if (is.na(seed)) {
    kick <- runif(1L)
    withr::local_preserve_seed()
  } else {
    withr::local_seed(seed)
  }

Would that solve your case? Could be any unfortunate side effects?

stefvanbuuren commented 2 years ago

An added note. My simulations usually set the seed explicitly, e.g.

for (i in 1:5) {
  imp <- mice(nhanes, m = 1, printFlag = FALSE, seed = i)
  cat(complete(imp)$bmi |> mean(), "\n")
}

This is perhaps a bit safer.

GBA19 commented 2 years ago

I was looking into this just now and thougth that the following might be the solution. Forgive me if I am totally wrong since I do not fully understand the withr::local_preserve_seed(). Nevertheless, I am just returning from the man-page.

if (is.na(seed)) {
    withr::local_preserve_seed() # restores .Random.seed on exiting scope
    set.seed(NULL) # reinitialize .Random.seed
  } else {
    withr::local_seed(seed)
  }

If this is totally wrong, so be it; if not, this appears to be more elegant.

stefvanbuuren commented 2 years ago

Yes thanks, that is better as that will restore .Random.seed after returning from mice().

I just pushed an update.

Thanks a lot for your contribution.

thomvolker commented 2 years ago

Pretty late to the party, but this new implementation also results in unexpected, and potentially unwanted, behavior. Namely, setting a seed in the global environment is reversed within the mice call, such that the results are always unique if no seed is specified within the mice() call. Setting a seed in the global environment thus no longer ensures reproducibility of the results, which is something that I think would be preferable over the current solution.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library(magrittr)

set.seed(123)

imp1 <- mice(boys, print = F)

imp1 %$%
  lm(hgt ~ bmi + tv) %>%
  pool() %>%
  summary()
#>          term  estimate std.error statistic        df      p.value
#> 1 (Intercept) 45.750588 7.8251969  5.846573 173.68503 2.431348e-08
#> 2         bmi  2.990458 0.4926343  6.070341  92.33349 2.795360e-08
#> 3          tv  3.738712 0.2099027 17.811641  26.77784 2.220446e-16

set.seed(123)

imp2 <- mice(boys, print = F)

imp2 %$%
  lm(hgt ~ bmi + tv) %>%
  pool() %>%
  summary()
#>          term  estimate  std.error statistic       df      p.value
#> 1 (Intercept) 42.567502 10.7210055  3.970477 12.82719 1.638943e-03
#> 2         bmi  3.165119  0.6533964  4.844103 12.59753 3.503675e-04
#> 3          tv  3.746043  0.2230530 16.794411 19.05428 7.056578e-13

Created on 2022-07-08 by the reprex package (v2.0.1)

I am actually unsure whether the local_preserve_seed() argument is something you would want. What is wrong with the fact that a call to mice adjust the state of the random number generator, as it actually uses this randomness?

stefvanbuuren commented 2 years ago

Ah... what a can of worms this is...

We have in mice.R:

  # set local seed, reset random state generator after function aborts
  if (is.na(seed)) {
    # restores .Random.seed on exiting scope
    withr::local_preserve_seed()
    # reinitialize .Random.seed
    set.seed(NULL)
  } else {
    # take specified seed to set local seed
    withr::local_seed(seed)
  }

We probably need some alternative to set.seed(NULL)

GBA19 commented 2 years ago

The function umap::umap has preserve.seed as a parameter, also discussed in a vignette.

I see three usage scenarios, always assuming the default seed = NA on the mice function call:

  1. Repeatedly calling mice without changing the seed in a parent or ancestor environment: We would expect a randomly different result on each call.
  2. Setting the seed in a parent or ancestor environment: We would expect the sequence of seeds in both mice and a parent or ancestor environment to replicate exactly.
  3. Preserve the global seed despite changes within mice as requested in #426: I do not fully grasp the importance of this scenario.

As I understand the present state mice fulfills items 1 and 3 but fails on 2.

Maybe a preserve.seed= FALSE as a default parameter is a way to fulfill 1 and 2 and setting preserve.seed= TRUE could fulfill item 3. This would entail restoring the previous behavior and make item 3 a special case.

Please note: I do not feel strongly about this at all, just naively trying to be of some use.

thomvolker commented 2 years ago

I am no expert on random seeds, but my intuition says that these three aspects are incompatible, because if after every mice() you return to the stage of the random number generator before this call, any attempt to reproducibility (specified in the global environment) will likely lead to identical results over different calls.

Similarly to @GBA19, I do not really understand why you would like to preserve the global seed when you called mice. To me, this seems a bit like asking rnorm() to not affect the state of the random number generator. Still, there is apparently some interest in it, and a preserve.seed parameter might provide a solution.

stefvanbuuren commented 1 year ago

Since the local seed gave various unforeseen problems, mice 3.14.12 reverted seed setting behaviour to mice 3.13.10. Local seeds are gone. Trust this solves the issue.