amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
428 stars 107 forks source link

`futuremice` scales poorly on 64/128 cores machines #566

Closed vkhodygo closed 1 year ago

vkhodygo commented 1 year ago

Describe the bug futuremice runs just fine when I do 10 or 20 imputations at the same time. When I increase the number to say 50 or 100 while keeping all other parameters the same, it just sits there indefinitely.

To Reproduce Not sure that any code would be suitable here.

Expected behavior Running 10, 20, 50, 100 imputations at once should take roughly the same amount of time.

stefvanbuuren commented 1 year ago

This sounds more like a resource problem than a bug.

vkhodygo commented 1 year ago

@stefvanbuuren How is that so?

stefvanbuuren commented 1 year ago

It would be useful if we can have a reprex somehow, otherwise it is very hard for us to chase. Did you try setting m = 200 with a small problem?

vkhodygo commented 1 year ago

I can give it a go, but that might take some time.

vkhodygo commented 1 year ago

@stefvanbuuren I tried to come up with an example that closely matches my data, but that's cumbersome. Instead, I used the code from the futuremice vignette:

library(mice)
version()

set.seed(123)

n_features = 20
small_covmat <- diag(n_features)
small_covmat[small_covmat == 0] <- 0.5
small_data <- MASS::mvrnorm(10000, 
                      mu = c(1:n_features) * 0,
                      Sigma = small_covmat)

small_data_with_missings <- ampute(small_data, prop = 0.8, mech = "MCAR")$amp

n_streams <- 5
start_time <- Sys.time()

imp <- futuremice(small_data_with_missings,
                  parallelseed = 123,
                  n.core = n_streams,
                  m = n_streams,
                  maxit = 1,
                  method = "rf",
                  ntrees=10)

end_time <- Sys.time()
end_time - start_time
$ Rscript main.R

Attaching package: ‘mice’

The following object is masked from ‘package:stats’:

    filter

The following objects are masked from ‘package:base’:

    cbind, rbind

[1] "mice 3.16.0 2023-05-24 /home/software/.local/easybuild/software/R/4.2.0-foss-2021b/lib/R/library"
Time difference of 24.37422 secs

This is what I get when n_streams == 50:

$ Rscript main.R

Attaching package: ‘mice’

The following object is masked from ‘package:stats’:

    filter

The following objects are masked from ‘package:base’:

    cbind, rbind

[1] "mice 3.16.0 2023-05-24 /home/software/.local/easybuild/software/R/4.2.0-foss-2021b/lib/R/library"
Time difference of 2.550763 mins

and when n_features = 100:

$ Rscript main.R

Attaching package: ‘mice’

The following object is masked from ‘package:stats’:

    filter

The following objects are masked from ‘package:base’:

    cbind, rbind

[1] "mice 3.16.0 2023-05-24 /home/software/.local/easybuild/software/R/4.2.0-foss-2021b/lib/R/library"
Time difference of 2.003762 mins

Real-life numbers are much-much worse since I work mostly with categories, and their number is significantly higher. As this number goes up, literally every process starts spawning threads like there is no tomorrow. I understand that there is some overhead, but that's a bit too much. This automatically results in 100% load even when n_streams is low.

Just to show what I have to deal with: the same code with the actual data and n_streams == 5 needs about 25 minutes to finish the very first iteration on my laptop. On the cluster with two CPUs, 32 cores each, the code with n_streams == 25 has been running for an hour, and I have no idea when it'll be done.

vkhodygo commented 1 year ago

@stefvanbuuren Got it done, at least something: The same code with n_streams == 5 on the cluster:

Time difference of 2.386232 hours

and with n_streams == 25:

Time difference of 12.78157 hours

I'd blame Intel MKL or something https://github.com/HenrikBengtsson/future/issues/405 , but those are AMD machines.

stefvanbuuren commented 1 year ago

Real-life numbers are much-much worse since I work mostly with categories, and their number is significantly higher.

Perhaps the problem is not with futuremice() but caused by a large number of categories. If you have 1001 categories, then mice tries to create a 1000 dummy variables...

What happens if you specify method = "pmm"?

vkhodygo commented 1 year ago

caused by a large number of categories. If you have 1001 categories, then mice tries to create a 1000 dummy variables...

The number of categories is high, that's true. However, this should not affect parallel and independent imputations. I can run a handful of them just fine, but any further than that and it takes too much time.

Anyway, real data with pmm:

Error in (function (.x, .f, ..., .progress = FALSE)  : 
  ℹ In index: 1.
Caused by error in `chol.default()`:
! the leading minor of order 1 is not positive
Calls: futuremice ... resolve.list -> signalConditionsASAP -> signalConditions
Execution halted

The reprex works just fine.

stefvanbuuren commented 1 year ago

Thanks. On my desktop with 9 free cores, I found your reprex executes uses 15 seconds (n = 5), 14 seconds (n = 9), 21 seconds (n = 18), 47 seconds (n = 50) and 1.37 minutes (n = 100). I think this is as it should be.

I am not sure what causes the leading minor of order 1 is not positive error, but I have seen this error appearing when there are a lot of collinear variables. mice() tries very hard to remove these during the iteration by means of the internal remove.lindep() function. This checking process is - however - inefficient and can sometimes take >99% of the processor time.

Random forests (method = "rf") is quite robust against collinear variables, so remove.lindep() may be an overkill for your setup. It is possible to bypass remove.lindep() by adapting your call to mice(..., eps = 0). Could you gives this a try on the real data?

225 #306

stefvanbuuren commented 1 year ago

It could also help to simplify your model, e.g., by using quickpred() to select the most important predictors for each variable that you want to impute.