amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
447 stars 108 forks source link

On futuremice() and reproducibility #557

Open edbonneville opened 1 year ago

edbonneville commented 1 year ago

Dear {mice} team,

Thank you for your continued work on this package!

I had a couple of suggestions/questions regarding the futuremice() function. The first is on how the plan() is currently set within the function, which not recommended as per one of the {future} vignettes. If someone is using futuremice() because they want to make use of parallel processing anyway, would it not be more flexible to let them set their own plan()? Multiple arguments (e.g. future.plan, use.logical, n.core, parallelseed) would then no longer be necessary, and would allow advanced users to specify nested futures (e.g. when you want to parallelize imputations within parallelized simulation replications).

The second (related) point is that I think one of the major advantages of using this future framework is that your results can be reproducible regardless of how another user decides to parallelize the same analysis (e.g. local computer vs. computing cluster, differing numbers of cores etc). In the current implementation, futuremice() will not be able to reproduce the same set of imputations when only the number of cores differs (same dataset, number of imputations, and seed). For example:

# Globals
library(mice)
library(future)
dat <- mice::nhanes2
m <- 5

# Same seed, same m, but different n.core -> different results
set.seed(123)
with(futuremice(dat, m = m, n.core = 2), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate        ubar           b           t dfcom        df
#> 1 (Intercept) 5 104.141215 3315.440261 1541.724538 5165.509707    23  9.482795
#> 2         bmi 5   3.407003    4.754373    1.771861    6.880607    23 10.864951
#>         riv    lambda       fmi
#> 1 0.5580162 0.3581582 0.4609944
#> 2 0.4472164 0.3090183 0.4086915

set.seed(123)
with(futuremice(dat, m = m, n.core = 3), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate        ubar           b           t dfcom       df
#> 1 (Intercept) 5 109.612876 3135.720812 1783.198371 5275.558857    23 8.307415
#> 2         bmi 5   3.088449    4.326505    2.182976    6.946076    23 8.994692
#>         riv    lambda       fmi
#> 1 0.6824071 0.4056135 0.5107457
#> 2 0.6054705 0.3771296 0.4809873

This is a side effect of n.core dictating how the m imputations are spread out across the cores. A way to make the imputations reproducible regardless of n.core might look something like this:

# Tweaked futuremice()
library(future.apply) # since fewer dependencies than {furrr}
futuremice2 <- function(data, m, ...) {
  mids_list <- future_lapply(
    X = seq_len(m),
    FUN = function(x) mice::mice(data = data, m = 1, ...),
    future.seed = TRUE
  )
  imp <- Reduce(f = mice::ibind, x = mids_list)
  return(imp)
}

# Same seed, same m, but different n.core -> same results
set.seed(123)
plan(multisession, workers = 2)
with(futuremice2(dat, m = m, print = FALSE), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate       ubar           b           t dfcom       df
#> 1 (Intercept) 5 107.820027 2789.44692 673.0012948 3597.048470    23 13.63506
#> 2         bmi 5   3.283595    3.87059   0.7406842    4.759411    23 15.00680
#>         riv    lambda       fmi
#> 1 0.2895203 0.2245178 0.3177525
#> 2 0.2296345 0.1867502 0.2770772

set.seed(123)
plan(multisession, workers = 3)
with(futuremice2(dat, m = m, print = FALSE), lm(chl ~ bmi)) |> pool()
#> Class: mipo    m = 5 
#>          term m   estimate       ubar           b           t dfcom       df
#> 1 (Intercept) 5 107.820027 2789.44692 673.0012948 3597.048470    23 13.63506
#> 2         bmi 5   3.283595    3.87059   0.7406842    4.759411    23 15.00680
#>         riv    lambda       fmi
#> 1 0.2895203 0.2245178 0.3177525
#> 2 0.2296345 0.1867502 0.2770772

# Reset back to sequential
plan(sequential)

Of course if you do some benchmarks, futuremice2() will be slower than futuremice(), since the former calls mice() a whole m times while the latter only n.core times. Nevertheless, a) futuremice2() will still outperform sequential mice() (depending on size of dataset and number of imputations); b) you could speed it up further by for example calling mice() a first time to perform the necessary checks, and then only call a function which would boil down to running mice() from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.

So my main point: the current implementation does optimize computational resources nicely, but not reproducibility - so something like futuremice2() could be a compromise. What do you think? I guess two arguments for keeping the current implementation is that a) futuremice() should focus on optimizing speed; b) we should anyway be using a large enough number of imputed datasets such that our results do not differ much between independent imputation rounds.. :sweat_smile:

Thanks!

Session info

devtools::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  Dutch_Netherlands.utf8
#>  ctype    Dutch_Netherlands.utf8
#>  tz       Europe/Berlin
#>  date     2023-05-19
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
#>  backports      1.4.1   2021-12-13 [1] CRAN (R 4.2.0)
#>  broom          1.0.1   2022-08-29 [1] CRAN (R 4.2.1)
#>  cachem         1.0.6   2021-08-19 [1] CRAN (R 4.2.1)
#>  callr          3.7.3   2022-11-02 [1] CRAN (R 4.2.2)
#>  cli            3.4.1   2022-09-23 [1] CRAN (R 4.2.2)
#>  codetools      0.2-18  2020-11-04 [2] CRAN (R 4.2.1)
#>  crayon         1.5.2   2022-09-29 [1] CRAN (R 4.2.2)
#>  DBI            1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
#>  devtools       2.4.5   2022-10-11 [1] CRAN (R 4.2.2)
#>  digest         0.6.30  2022-10-18 [1] CRAN (R 4.2.2)
#>  dplyr          1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.2.1)
#>  evaluate       0.18    2022-11-07 [1] CRAN (R 4.2.2)
#>  fansi          1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
#>  furrr          0.3.1   2022-08-15 [1] CRAN (R 4.2.1)
#>  future       * 1.29.0  2022-11-06 [1] CRAN (R 4.2.2)
#>  future.apply * 1.10.0  2022-11-05 [1] CRAN (R 4.2.2)
#>  generics       0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
#>  globals        0.16.1  2022-08-28 [1] CRAN (R 4.2.1)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.2.1)
#>  htmltools      0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  htmlwidgets    1.5.4   2021-09-08 [1] CRAN (R 4.2.1)
#>  httpuv         1.6.6   2022-09-08 [1] CRAN (R 4.2.2)
#>  knitr          1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  later          1.3.0   2021-08-18 [1] CRAN (R 4.2.1)
#>  lattice        0.20-45 2021-09-22 [2] CRAN (R 4.2.1)
#>  lifecycle      1.0.3   2022-10-07 [1] CRAN (R 4.2.2)
#>  listenv        0.8.0   2019-12-05 [1] CRAN (R 4.2.1)
#>  magrittr       2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
#>  memoise        2.0.1   2021-11-26 [1] CRAN (R 4.2.1)
#>  mice         * 3.15.0  2022-11-19 [1] CRAN (R 4.2.3)
#>  mime           0.12    2021-09-28 [1] CRAN (R 4.2.0)
#>  miniUI         0.1.1.1 2018-05-18 [1] CRAN (R 4.2.1)
#>  parallelly     1.32.1  2022-07-21 [1] CRAN (R 4.2.1)
#>  pillar         1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgbuild       1.3.1   2021-12-20 [1] CRAN (R 4.2.1)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
#>  pkgload        1.3.2   2022-11-16 [1] CRAN (R 4.2.1)
#>  prettyunits    1.1.1   2020-01-24 [1] CRAN (R 4.2.1)
#>  processx       3.8.0   2022-10-26 [1] CRAN (R 4.2.2)
#>  profvis        0.3.7   2020-11-02 [1] CRAN (R 4.2.1)
#>  promises       1.2.0.1 2021-02-11 [1] CRAN (R 4.2.1)
#>  ps             1.7.2   2022-10-26 [1] CRAN (R 4.2.2)
#>  purrr          0.3.5   2022-10-06 [1] CRAN (R 4.2.2)
#>  R.cache        0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3    1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo           1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils        2.12.2  2022-11-11 [1] CRAN (R 4.2.1)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
#>  Rcpp           1.0.9   2022-07-08 [1] CRAN (R 4.2.1)
#>  remotes        2.4.2   2021-11-30 [1] CRAN (R 4.2.1)
#>  reprex         2.0.2   2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang          1.0.6   2022-09-24 [1] CRAN (R 4.2.2)
#>  rmarkdown      2.18    2022-11-09 [1] CRAN (R 4.2.2)
#>  rstudioapi     0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
#>  shiny          1.7.3   2022-10-25 [1] CRAN (R 4.2.2)
#>  stringi        1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr        1.5.0   2022-12-02 [1] CRAN (R 4.2.3)
#>  styler         1.8.1   2022-11-07 [1] CRAN (R 4.2.2)
#>  tibble         3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyr          1.2.1   2022-09-08 [1] CRAN (R 4.2.2)
#>  tidyselect     1.2.0   2022-10-10 [1] CRAN (R 4.2.2)
#>  urlchecker     1.0.1   2021-11-30 [1] CRAN (R 4.2.1)
#>  usethis        2.1.6   2022-05-25 [1] CRAN (R 4.2.1)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
#>  vctrs          0.5.1   2022-11-16 [1] CRAN (R 4.2.1)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.2.1)
#>  xfun           0.35    2022-11-16 [1] CRAN (R 4.2.1)
#>  xtable         1.8-4   2019-04-21 [1] CRAN (R 4.2.1)
#>  yaml           2.3.6   2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/efbonneville/software/packagesR
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
thomvolker commented 1 year ago

Thanks for raising these points, @edbonneville!

The reason for using plan() inside the function is to make the use of futuremice() as easy as possible for those who are not too familiar with R, but I agree that it has some serious downsides in terms of flexibility. In my opinion, the best compromise would be one in which the inexperienced R user can fall back on a simple default parallel implementation that does not require any additional specifications, whereas a more advanced R user has the freedom to customize the entire process. One option might be to use the user-specified future::plan() if such a plan is specified by the user, and leaving it as is if nothing is specified (i.e., if the plan is just sequential()), using something like the following.

oplan <- future::plan()
if (is(future::plan(), "sequential")) {
  plan("multisession")
}
on.exit(future::plan(oplan), add = TRUE)

However, this will not solve the second point. I again agree that reproducibility is important, but I think that this will remain an issue regardless of how we adapt futuremice(). Consider for example the following example, which yields different results depending on the future strategy (which implies similar results for differences between mice() and futuremice()):

set.seed(123)
purrr::map_dbl(1:10, ~rnorm(1))
#>  [1] -0.56047565 -0.23017749  1.55870831  0.07050839  0.12928774  1.71506499
#>  [7]  0.46091621 -1.26506123 -0.68685285 -0.44566197
set.seed(123)
future::plan("multisession")
furrr::future_map_dbl(1:10, ~rnorm(1),
                      .options = furrr::furrr_options(seed = TRUE))
#>  [1] -2.03403746 -0.37386282  0.21503119 -1.17845115 -1.45315445 -2.06945243
#>  [7]  1.75945588 -0.06722657  0.92327056 -0.56712447

Created on 2023-05-19 with reprex v2.0.2

I haven't done much research into whether it is possible to obtain exactly reproducible results regardless of the future strategy, so if you know a way to mitigate this issue I would be very interested to implement it! Given this, and because futuremice() was explicitly developed to improve the speed of mice() as much as possible, I do not see much merit in the m = 1-approach, but if there is an alternative approach that is (almost) as fast as the current implementation and yields reproducible results regardless of the cluster specification, I would love to implement it.

edbonneville commented 1 year ago

Thanks for the detailed answer @thomvolker ! I would agree your suggestion regarding the use of plan() would be a nice compromise.

Your example on future strategies I do not think is fair, since furrr::future_map_dbl() with plan(sequential) is not the same as purrr::map_dbl(). Indeed, you should be comparing:

set.seed(123)
future::plan("sequential")
furrr::future_map_dbl(
  1:10,
  ~ rnorm(1),
  .options = furrr::furrr_options(seed = TRUE)
)
#> Warning: package 'purrr' was built under R version 4.2.2
#>  [1] -2.03403746 -0.37386282  0.21503119 -1.17845115 -1.45315445 -2.06945243
#>  [7]  1.75945588 -0.06722657  0.92327056 -0.56712447

set.seed(123)
future::plan("multisession")
furrr::future_map_dbl(
  1:10,
  ~ rnorm(1),
  .options = furrr::furrr_options(seed = TRUE)
)
#>  [1] -2.03403746 -0.37386282  0.21503119 -1.17845115 -1.45315445 -2.06945243
#>  [7]  1.75945588 -0.06722657  0.92327056 -0.56712447

.. which does yield the same results. If we take the {mice} analogy, I actually do not think it is an issue that futuremice(...) with plan(sequential) and mice(...) do not yield the exact same imputations. At the end of the day, my guess is a user will have either futuremice(...) or mice(...) in their script, but not both - so we just need to try and reproduce the one that is used.

I haven't done much research into whether it is possible to obtain exactly reproducible results regardless of the future strategy, so if you know a way to mitigate this issue I would be very interested to implement it! Given this, and because futuremice() was explicitly developed to improve the speed of mice() as much as possible, I do not see much merit in the m = 1-approach, but if there is an alternative approach that is (almost) as fast as the current implementation and yields reproducible results regardless of the cluster specification, I would love to implement it.

We would need to contact the maintainer of the {future} package to be sure, but I do not think there is a way with the current implementation to obtain reproducible results regardless of the plan used. However, I did some extra tests with my suggestion to speed up the m = 1 approach:

b) you could speed it up further by for example calling mice() a first time to perform the necessary checks, and then only call a function which would boil down to running mice() from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.

.. and it turns out this is competitive with futuremice(). Here is a reproducible example:

# Globals
library(mice)
library(future)
library(microbenchmark)
library(ggplot2)

# Proposed function
futuremice3 <- function(data,
                        m,
                        # data.init + printFlag added here since not part of mids list
                        data.init = NULL,
                        printFlag = FALSE,
                        ...) {

  # Run initial mice() to run all checks
  mids.init <- mice::mice(
    data,
    m = 1,
    printFlag = printFlag,
    data.init = data.init,
    ...
  )
  total_m <- m # Keep desired m separate, otherwise will be overwritten in list2env call (turned to m = 1)

  # Populate the function environment with the element of the mids list a single time
  list2env(x = mids.init, envir = environment())
  maxit <- iteration

  # Closure which makes use of the above variables
  # It is the mice() function but from this line:
  # https://github.com/amices/mice/blob/ea6d83453b39f489c4991662565dc3c8d32cfd4b/R/mice.R#LL428C5-L428C5
  add.extra.imputations <- function(m, ...) {

    # data frame for storing the event log
    state <- list(it = 0, im = 0, dep = "", meth = "", log = FALSE)
    loggedEvents <- data.frame(it = 0, im = 0, dep = "", meth = "",
                               out = "")

    # initialize imputations
    nmis <- apply(is.na(data), 2, sum)
    imp <- mice:::initialize.imp(
      data, m, ignore, where, blocks, visitSequence,
      method, nmis, data.init
    )

    # and iterate...
    from <- 1
    to <- from + maxit - 1
    q <- mice:::sampler(
      data, m, ignore, where, imp, blocks, method,
      visitSequence, predictorMatrix, formulas, blots,
      post, c(from, to), printFlag, ...
    )

    if (!state$log) loggedEvents <- NULL
    if (state$log) row.names(loggedEvents) <- seq_len(nrow(loggedEvents))

    ## save, and return
    midsobj <- list(
      data = data,
      imp = q$imp,
      m = m,
      where = where,
      blocks = blocks,
      call = call,
      nmis = nmis,
      method = method,
      predictorMatrix = predictorMatrix,
      visitSequence = visitSequence,
      formulas = formulas,
      post = post,
      blots = blots,
      ignore = ignore,
      seed = seed,
      iteration = q$iteration,
      lastSeedValue = get(".Random.seed",
                          envir = globalenv(), mode = "integer",
                          inherits = FALSE
      ),
      chainMean = q$chainMean,
      chainVar = q$chainVar,
      loggedEvents = loggedEvents,
      version = packageVersion("mice"),
      date = Sys.Date()
    )
    oldClass(midsobj) <- "mids"

    if (!is.null(midsobj$loggedEvents)) {
      warning("Number of logged events: ", nrow(midsobj$loggedEvents),
              call. = FALSE
      )
    }
    midsobj
  }

  # Same procedure as futuremice2(), but now will be much faster
  mids_list <- future.apply::future_lapply(
    X = seq_len(total_m - 1),
    FUN = function(x) add.extra.imputations(m = 1, ...),
    future.seed = TRUE
  )
  imp <- Reduce(f = mice::ibind, x = c(list(mids.init), mids_list))
  return(imp)
}

# Generate data like futuremice vignette here:
# https://github.com/gerkovink/miceVignettes/blob/ec5235007366b7b9b56723154d7e9475da5f8c43/futuremice/Vignette_futuremice.Rmd#L123
set.seed(20230520)
large_covmat <- diag(8)
large_covmat[large_covmat == 0] <- 0.5
large_data <- MASS::mvrnorm(
  n = 10000,
  mu = c(0, 0, 0, 0, 0, 0, 0, 0),
  Sigma = large_covmat
)

large_data_with_missings <- ampute(large_data, prop = 0.8, mech = "MCAR")$amp
head(large_data_with_missings)
#>           V1          V2         V3         V4          V5         V6
#> 1 -1.5119431 -1.71881533 -1.7187446 -0.4127288 -1.89156861 -0.7399172
#> 2  0.2485728  0.04659958 -0.9330806  1.5763965  1.39634609         NA
#> 3  0.6672462 -1.14396682         NA -0.8888281 -0.06917341  0.2223045
#> 4 -0.2795157 -1.44177321  0.7633885         NA -0.32258739 -0.3794018
#> 5  0.6114431 -0.57473468 -0.1585314 -0.1170578  0.74479866  0.3170481
#> 6 -0.1955333  0.70432678         NA  0.8053712 -0.17290571  0.1633985
#>           V7          V8
#> 1 -0.6165913 -2.59207333
#> 2  0.2685161 -0.32164220
#> 3 -0.7292466  0.40230824
#> 4 -0.2542349 -1.02424692
#> 5 -0.5111598          NA
#> 6  0.7281126  0.06050003

cores <- 3L
m <- 50

# Will take a good 10 minutes
mb <- microbenchmark(
  "futuremice" = futuremice(large_data_with_missings, m = m, n.core = cores),
  "futuremice3" = {
    plan(multisession, workers = cores)
    futuremice3(large_data_with_missings, m = m)
  },
  "sequential" = mice(large_data_with_missings, m = m, printFlag = FALSE),
  times = 10
)

mb
#> Unit: seconds
#>         expr      min       lq     mean   median       uq      max neval cld
#>   futuremice 12.78954 12.86768 12.97507 12.90699 13.04418 13.38369    10  a 
#>  futuremice3 12.72041 12.78315 13.31132 12.99573 13.69118 14.86115    10  a 
#>   sequential 19.80765 19.94718 20.17635 20.15332 20.43851 20.55581    10   b
#autoplot(mb)

Definitely worth doing some extra checking, but this would solve the reproducibility issue without sacrificing speed (at least in the above example with large-ish data). And of course it is not very clean that I had to copy code directly from the mice(), but it is a first try..

thomvolker commented 1 year ago

Thanks for another good contribution!

You're right, I should have compared with a sequential plan. I also like your suggestion, but I would need to check whether we don't lose mice functionality in futuremice in this set-up. My ideal solution would be to simply parallelize the sampler within mice, such that there is no duplicate code whatsoever, and that relies on a "parallel = TRUE" argument in mice, and with the possibility of setting your own plan and everything, but that requires some more thinking first!