amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
428 stars 107 forks source link

Add progress bar to `futuremice()` #516

Closed isaactpetersen closed 1 year ago

isaactpetersen commented 1 year ago

Imputations using large data can take a long time, and it can be helpful to have a sense of how long it will take to complete the imputation. It would be nice to have a progress bar for imputations when using futuremice() for parallel processing. For instance, the futuremice package implements a progress bar with the following code:

# Use {progress} package for progress bar - shows diagnostics in real time
progressr::handlers("progress")

# Use `progressr::with_progress()` to show the progress bar
mids <- progressr::with_progress(future_mice(mice::nhanes))
gerkovink commented 1 year ago

Thanks for the suggestion. This would unfortunately not work in futuremice() as the optimal matching of sets over cores is parallelised. Including progressor() in futuremice() would start the progress bar at 0 and conclude it at the end of future_map(), without intermediate steps. See the reprex below.

#devtools::install_github("gerkovink/mice@progressbar")
library(mice, warn.conflicts = FALSE)
mice:::match.cluster(n.core = 7, m = 357)
#>     cores imps
#> 357     7   51
imp <- progressr::with_progress(futuremice(nhanes, m = 357, n.core = 7))
imp$m
#> [1] 357

Created on 2022-11-10 with reprex v2.0.2

This reprex depends on this temporary build of mice - see commented line in the code.

Including p() somewhere in mice, could potentially pose an answer, but it must only be invoked when futuremice() is called and it may potentially break frame dependency without the creation of a separate environment.

This has no urgency, as mice() itself already has progress printing.

gerkovink commented 1 year ago

Closing as adding a progressbar to futuremice() would currently not be informative.

isaactpetersen commented 1 year ago

I understand if a progress bar isn't a high priority for development. However, such a feature would be very helpful to users with large data sets or complicated models. A key reason one would use parallel processing is because using mice() would take too long to run. A progress bar would help give users a sense of (a) approximately how much longer the imputation model will take to run and whether slowness owes to (b) the model being slow to run or to (c) the model becoming stuck during an iteration (which happens not infrequently). The fact that other implementations that parallelize mice have progress bars (for example, see the futuremice package) provides clear evidence that people find such a feature useful.

gerkovink commented 1 year ago

It's not that I don't want to, it is that it's not informative in the implementation of futuremice(). If you want a progress bar, feel free to install the mice version I linked above. It will allow for a progressbar with futuremice(). I was unaware of the future mice package and of futuremice::future_mice() and I don't really know how the implementation is. That said, I believe that the mice implementation is the most speedy, as the imputations are optimised over allowed or available cores. The by-product of that rationale is that the m imputations are equal over cores and that a progressr::with_progress() call only has two resulting steps: start and finish.

To demonstrate why I would gladly forego the informative progress bar implementation in favour of the speedy implementation:

library(future)
library(futuremice)
library(mice, warn.conflicts = FALSE)

# how many cores
future::availableCores()
#> system 
#>     10

st <- Sys.time()
set.seed(123)
future::plan("multisession", workers = pmin(2L, future::availableCores()))
A <- future_mice(nhanes, m = 150, seed = 123, maxit = 50)
#> Converged in 33 iterations
#> R-hat: 1.025/1.016/1.027/1.02
B <- future_mice(nhanes, m = 150, seed = 123, maxit = 50)
#> Converged in 33 iterations
#> R-hat: 1.025/1.016/1.027/1.02
identical(A$imp, B$imp)
#> [1] TRUE
identical(complete(A, 5), complete(B, 5))
#> [1] TRUE
!identical(complete(A, 1), complete(A, 2))
#> [1] FALSE
future::plan("sequential")
Sys.time() - st
#> Time difference of 10.98159 mins

st <- Sys.time()
set.seed(123)
A <- futuremice(nhanes, m = 150, n.core = 10, maxit = 50, parallelseed = 123)
B <- futuremice(nhanes, m = 150, n.core = 10, maxit = 50, parallelseed = 123)
identical(A$imp, B$imp)
#> [1] TRUE
identical(complete(A, 5), complete(B, 5))
#> [1] TRUE
!identical(complete(A, 1), complete(A, 2))
#> [1] TRUE
Sys.time() - st
#> Time difference of 11.47086 secs

Created on 2022-11-12 with reprex v2.0.2

mice::futuremice() is in this case more than 57 times faster. Also, the $m=5$ imputed datasets are correctly not identical for mice::futuremice().

gerkovink commented 1 year ago

@thomvolker

thomvolker commented 1 year ago

The problem here is that futuremice just runs mice multiple times, distributed over cores. Running mice creates overhead, because it will perform quite some checks that only need to be done once. Doing this for every imputation would make the function slower.

In my opinion, the ideal solution would be to have an option to parallelize the sampler in mice, which is what uses most of the time. However, this could potentially introduce new issues, not the least with respect to backward compatibility.

I agree that it would be very useful to have a progress bar, but the current implementation, which was chosen for its efficiency, does not allow for a straightforward implementation of a progress bar, as @gerkovink detailed above. If efficiency is less important than knowing at what stage of the imputations you are, you can use the following code to implement a progress bar yourself. In the meantime, I will try to think of a way to implement a progress bar in futuremice that does not affect its efficiency.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library(furrr)
#> Warning: package 'furrr' was built under R version 4.2.2
#> Loading required package: future
library(purrr)

set.seed(123)

future::availableCores()
#> system 
#>      8

m <- 80

plan(multisession)

progressr::with_progress({
  p <- progressr::progressor(along = 1:m)
  imps <- future_map(1:m, function(x) {
    p(sprintf("x=%g", x))
    imp <- mice(boys,
                m=1,
                maxit=50,
                printFlag = F)
    imp
  }, .options = furrr_options(seed = TRUE))
})

obj <- imps[[1]]

for(i in 2:length(imps)) {
  obj <- ibind(obj, imps[[i]])
}

obj$imp <- map(obj$imp, function(x) {
  colnames(x) <- 1:ncol(x)
  x
})

complete(obj, 1:2, mild = TRUE) |>
  map(head)
#> $`1`
#>      age  hgt   wgt   bmi   hc gen phb tv   reg
#> 3  0.035 50.1 3.650 14.54 33.7  G2  P2  2 south
#> 4  0.038 53.5 3.370 11.77 35.0  G3  P4  1 south
#> 18 0.057 50.0 3.140 12.56 35.2  G3  P4  1 south
#> 23 0.060 54.5 4.270 14.37 36.7  G1  P1  3 south
#> 28 0.062 57.5 5.030 15.21 37.3  G1  P1  3 south
#> 36 0.068 55.5 4.655 15.11 37.0  G1  P1  1 south
#> 
#> $`2`
#>      age  hgt   wgt   bmi   hc gen phb tv   reg
#> 3  0.035 50.1 3.650 14.54 33.7  G1  P1  3 south
#> 4  0.038 53.5 3.370 11.77 35.0  G3  P3  2 south
#> 18 0.057 50.0 3.140 12.56 35.2  G4  P3  2 south
#> 23 0.060 54.5 4.270 14.37 36.7  G1  P2  2 south
#> 28 0.062 57.5 5.030 15.21 37.3  G1  P1  1 south
#> 36 0.068 55.5 4.655 15.11 37.0  G1  P1  1 south

Created on 2022-11-15 with reprex v2.0.2

gerkovink commented 1 year ago

Thanks @thomvolker.

isaactpetersen commented 1 year ago

Thanks very much for considering this! Would it be possible to re-open the issue?

gerkovink commented 1 year ago

It is not an issue with mice. I've noted it as a feature request. If we decide to change the sampler to a parallel approach, we'll definitely implement it.

thomvolker commented 1 year ago

@gerkovink We could implement a progress bar by just adding the code in my reprex to the futuremice function, and only call this adjusted method when the user specifies progressbar = TRUE in the futuremice call?

Otherwise, I don't see any merit in reopening this issue, because the only useful way of doing this is by revising the sampler, which is not really a priority.

gerkovink commented 1 year ago

@thomvolker wouldn't that impact speed?

thomvolker commented 1 year ago

It would. Especially for large data sets, I suppose. That's why I would generally not use a progress bar. But if users want to sacrifice speed for a progress bar, who am I to stop them? By default, I would set such an argument to FALSE.

gerkovink commented 1 year ago

Call me old-fashioned, but I don't like advocating a suboptimal implementation. @stefvanbuuren what do you think?

stefvanbuuren commented 1 year ago

I can see the value of a progress bar. In single-core mice() we get to see the iteration print. In a parallel version we currently don't see anything. But we don't want the process to be much slower.

Implementing interprocess communication inevitably slows down things. We could bypass mice initialisation by something like burst <- 3; imp <- futuremice(data, maxit = burst); imp <- futuremicemids(imp, maxit = burst), which reinitialises the parallel process every burst iterations, but I think the payoff will not be large because initialisation is generally rapid. How well it work depends on the tradeoff between parallelisation time against imputation time, and that in turn depends on the size and complexity of the data.

Adding a progress bar should increase execution time by not more than 10 percent. Are we able to achieve that?