Open edbonneville opened 1 year ago
Thanks for raising these points, @edbonneville!
The reason for using plan()
inside the function is to make the use of futuremice()
as easy as possible for those who are not too familiar with R
, but I agree that it has some serious downsides in terms of flexibility. In my opinion, the best compromise would be one in which the inexperienced R
user can fall back on a simple default parallel implementation that does not require any additional specifications, whereas a more advanced R
user has the freedom to customize the entire process. One option might be to use the user-specified future::plan()
if such a plan is specified by the user, and leaving it as is if nothing is specified (i.e., if the plan is just sequential()
), using something like the following.
oplan <- future::plan()
if (is(future::plan(), "sequential")) {
plan("multisession")
}
on.exit(future::plan(oplan), add = TRUE)
However, this will not solve the second point. I again agree that reproducibility is important, but I think that this will remain an issue regardless of how we adapt futuremice()
. Consider for example the following example, which yields different results depending on the future strategy (which implies similar results for differences between mice()
and futuremice()
):
set.seed(123)
purrr::map_dbl(1:10, ~rnorm(1))
#> [1] -0.56047565 -0.23017749 1.55870831 0.07050839 0.12928774 1.71506499
#> [7] 0.46091621 -1.26506123 -0.68685285 -0.44566197
set.seed(123)
future::plan("multisession")
furrr::future_map_dbl(1:10, ~rnorm(1),
.options = furrr::furrr_options(seed = TRUE))
#> [1] -2.03403746 -0.37386282 0.21503119 -1.17845115 -1.45315445 -2.06945243
#> [7] 1.75945588 -0.06722657 0.92327056 -0.56712447
Created on 2023-05-19 with reprex v2.0.2
I haven't done much research into whether it is possible to obtain exactly reproducible results regardless of the future strategy, so if you know a way to mitigate this issue I would be very interested to implement it! Given this, and because futuremice()
was explicitly developed to improve the speed of mice()
as much as possible, I do not see much merit in the m = 1
-approach, but if there is an alternative approach that is (almost) as fast as the current implementation and yields reproducible results regardless of the cluster specification, I would love to implement it.
Thanks for the detailed answer @thomvolker ! I would agree your suggestion regarding the use of plan()
would be a nice compromise.
Your example on future strategies I do not think is fair, since furrr::future_map_dbl()
with plan(sequential)
is not the same as purrr::map_dbl()
. Indeed, you should be comparing:
set.seed(123)
future::plan("sequential")
furrr::future_map_dbl(
1:10,
~ rnorm(1),
.options = furrr::furrr_options(seed = TRUE)
)
#> Warning: package 'purrr' was built under R version 4.2.2
#> [1] -2.03403746 -0.37386282 0.21503119 -1.17845115 -1.45315445 -2.06945243
#> [7] 1.75945588 -0.06722657 0.92327056 -0.56712447
set.seed(123)
future::plan("multisession")
furrr::future_map_dbl(
1:10,
~ rnorm(1),
.options = furrr::furrr_options(seed = TRUE)
)
#> [1] -2.03403746 -0.37386282 0.21503119 -1.17845115 -1.45315445 -2.06945243
#> [7] 1.75945588 -0.06722657 0.92327056 -0.56712447
.. which does yield the same results. If we take the {mice}
analogy, I actually do not think it is an issue that futuremice(...)
with plan(sequential)
and mice(...)
do not yield the exact same imputations. At the end of the day, my guess is a user will have either futuremice(...)
or mice(...)
in their script, but not both - so we just need to try and reproduce the one that is used.
I haven't done much research into whether it is possible to obtain exactly reproducible results regardless of the future strategy, so if you know a way to mitigate this issue I would be very interested to implement it! Given this, and because futuremice() was explicitly developed to improve the speed of mice() as much as possible, I do not see much merit in the m = 1-approach, but if there is an alternative approach that is (almost) as fast as the current implementation and yields reproducible results regardless of the cluster specification, I would love to implement it.
We would need to contact the maintainer of the {future}
package to be sure, but I do not think there is a way with the current implementation to obtain reproducible results regardless of the plan used. However, I did some extra tests with my suggestion to speed up the m = 1 approach:
b) you could speed it up further by for example calling mice() a first time to perform the necessary checks, and then only call a function which would boil down to running mice() from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.
.. and it turns out this is competitive with futuremice()
. Here is a reproducible example:
# Globals
library(mice)
library(future)
library(microbenchmark)
library(ggplot2)
# Proposed function
futuremice3 <- function(data,
m,
# data.init + printFlag added here since not part of mids list
data.init = NULL,
printFlag = FALSE,
...) {
# Run initial mice() to run all checks
mids.init <- mice::mice(
data,
m = 1,
printFlag = printFlag,
data.init = data.init,
...
)
total_m <- m # Keep desired m separate, otherwise will be overwritten in list2env call (turned to m = 1)
# Populate the function environment with the element of the mids list a single time
list2env(x = mids.init, envir = environment())
maxit <- iteration
# Closure which makes use of the above variables
# It is the mice() function but from this line:
# https://github.com/amices/mice/blob/ea6d83453b39f489c4991662565dc3c8d32cfd4b/R/mice.R#LL428C5-L428C5
add.extra.imputations <- function(m, ...) {
# data frame for storing the event log
state <- list(it = 0, im = 0, dep = "", meth = "", log = FALSE)
loggedEvents <- data.frame(it = 0, im = 0, dep = "", meth = "",
out = "")
# initialize imputations
nmis <- apply(is.na(data), 2, sum)
imp <- mice:::initialize.imp(
data, m, ignore, where, blocks, visitSequence,
method, nmis, data.init
)
# and iterate...
from <- 1
to <- from + maxit - 1
q <- mice:::sampler(
data, m, ignore, where, imp, blocks, method,
visitSequence, predictorMatrix, formulas, blots,
post, c(from, to), printFlag, ...
)
if (!state$log) loggedEvents <- NULL
if (state$log) row.names(loggedEvents) <- seq_len(nrow(loggedEvents))
## save, and return
midsobj <- list(
data = data,
imp = q$imp,
m = m,
where = where,
blocks = blocks,
call = call,
nmis = nmis,
method = method,
predictorMatrix = predictorMatrix,
visitSequence = visitSequence,
formulas = formulas,
post = post,
blots = blots,
ignore = ignore,
seed = seed,
iteration = q$iteration,
lastSeedValue = get(".Random.seed",
envir = globalenv(), mode = "integer",
inherits = FALSE
),
chainMean = q$chainMean,
chainVar = q$chainVar,
loggedEvents = loggedEvents,
version = packageVersion("mice"),
date = Sys.Date()
)
oldClass(midsobj) <- "mids"
if (!is.null(midsobj$loggedEvents)) {
warning("Number of logged events: ", nrow(midsobj$loggedEvents),
call. = FALSE
)
}
midsobj
}
# Same procedure as futuremice2(), but now will be much faster
mids_list <- future.apply::future_lapply(
X = seq_len(total_m - 1),
FUN = function(x) add.extra.imputations(m = 1, ...),
future.seed = TRUE
)
imp <- Reduce(f = mice::ibind, x = c(list(mids.init), mids_list))
return(imp)
}
# Generate data like futuremice vignette here:
# https://github.com/gerkovink/miceVignettes/blob/ec5235007366b7b9b56723154d7e9475da5f8c43/futuremice/Vignette_futuremice.Rmd#L123
set.seed(20230520)
large_covmat <- diag(8)
large_covmat[large_covmat == 0] <- 0.5
large_data <- MASS::mvrnorm(
n = 10000,
mu = c(0, 0, 0, 0, 0, 0, 0, 0),
Sigma = large_covmat
)
large_data_with_missings <- ampute(large_data, prop = 0.8, mech = "MCAR")$amp
head(large_data_with_missings)
#> V1 V2 V3 V4 V5 V6
#> 1 -1.5119431 -1.71881533 -1.7187446 -0.4127288 -1.89156861 -0.7399172
#> 2 0.2485728 0.04659958 -0.9330806 1.5763965 1.39634609 NA
#> 3 0.6672462 -1.14396682 NA -0.8888281 -0.06917341 0.2223045
#> 4 -0.2795157 -1.44177321 0.7633885 NA -0.32258739 -0.3794018
#> 5 0.6114431 -0.57473468 -0.1585314 -0.1170578 0.74479866 0.3170481
#> 6 -0.1955333 0.70432678 NA 0.8053712 -0.17290571 0.1633985
#> V7 V8
#> 1 -0.6165913 -2.59207333
#> 2 0.2685161 -0.32164220
#> 3 -0.7292466 0.40230824
#> 4 -0.2542349 -1.02424692
#> 5 -0.5111598 NA
#> 6 0.7281126 0.06050003
cores <- 3L
m <- 50
# Will take a good 10 minutes
mb <- microbenchmark(
"futuremice" = futuremice(large_data_with_missings, m = m, n.core = cores),
"futuremice3" = {
plan(multisession, workers = cores)
futuremice3(large_data_with_missings, m = m)
},
"sequential" = mice(large_data_with_missings, m = m, printFlag = FALSE),
times = 10
)
mb
#> Unit: seconds
#> expr min lq mean median uq max neval cld
#> futuremice 12.78954 12.86768 12.97507 12.90699 13.04418 13.38369 10 a
#> futuremice3 12.72041 12.78315 13.31132 12.99573 13.69118 14.86115 10 a
#> sequential 19.80765 19.94718 20.17635 20.15332 20.43851 20.55581 10 b
#autoplot(mb)
Definitely worth doing some extra checking, but this would solve the reproducibility issue without sacrificing speed (at least in the above example with large-ish data). And of course it is not very clean that I had to copy code directly from the mice()
, but it is a first try..
Thanks for another good contribution!
You're right, I should have compared with a sequential plan. I also like your suggestion, but I would need to check whether we don't lose mice functionality in futuremice in this set-up. My ideal solution would be to simply parallelize the sampler within mice, such that there is no duplicate code whatsoever, and that relies on a "parallel = TRUE" argument in mice, and with the possibility of setting your own plan and everything, but that requires some more thinking first!
Dear
{mice}
team,Thank you for your continued work on this package!
I had a couple of suggestions/questions regarding the
futuremice()
function. The first is on how theplan()
is currently set within the function, which not recommended as per one of the{future}
vignettes. If someone is usingfuturemice()
because they want to make use of parallel processing anyway, would it not be more flexible to let them set their ownplan()
? Multiple arguments (e.g. future.plan, use.logical, n.core, parallelseed) would then no longer be necessary, and would allow advanced users to specify nested futures (e.g. when you want to parallelize imputations within parallelized simulation replications).The second (related) point is that I think one of the major advantages of using this future framework is that your results can be reproducible regardless of how another user decides to parallelize the same analysis (e.g. local computer vs. computing cluster, differing numbers of cores etc). In the current implementation,
futuremice()
will not be able to reproduce the same set of imputations when only the number of cores differs (same dataset, number of imputations, and seed). For example:This is a side effect of
n.core
dictating how them
imputations are spread out across the cores. A way to make the imputations reproducible regardless ofn.core
might look something like this:Of course if you do some benchmarks,
futuremice2()
will be slower thanfuturemice()
, since the former callsmice()
a wholem
times while the latter onlyn.core
times. Nevertheless, a)futuremice2()
will still outperform sequentialmice()
(depending on size of dataset and number of imputations); b) you could speed it up further by for example callingmice()
a first time to perform the necessary checks, and then only call a function which would boil down to runningmice()
from here, which uses the defined predictorMatrix/maxit/method/etc from the first call.So my main point: the current implementation does optimize computational resources nicely, but not reproducibility - so something like
futuremice2()
could be a compromise. What do you think? I guess two arguments for keeping the current implementation is that a)futuremice()
should focus on optimizing speed; b) we should anyway be using a large enough number of imputed datasets such that our results do not differ much between independent imputation rounds.. :sweat_smile:Thanks!
Session info