amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
447 stars 108 forks source link

Inefficient initialization of `method` vector #672

Closed stefvanbuuren closed 1 month ago

stefvanbuuren commented 1 month ago

In make.method() there is a piece of inefficient code that makes it unnecessary slow for large datasets (high n):

  for (j in names(blocks)) {
    yvar <- blocks[[j]]
    y <- data[, yvar]
    def <- sapply(y, assign.method)
    k <- ifelse(all(diff(def) == 0), k <- def[1], 1)
    method[j] <- defaultMethod[k]
  }

The idea of supply(y, assign.method) is to test the variable type of all variables in the block, and assign the same method if these have the same types. However, if blocks[[j]] contains only one variable (which is almost always the case), then assign.method() is called for every data point in the y vector, which is highly inefficient.

The solution is to skip over the sapply() statement for single-variable blocks.

thomvolker commented 1 month ago

Another option is to use something like the following, which might be a bit cleaner (removing the if-else construction):

yvar <- blocks[[j]]
y <- data[, yvar, drop=FALSE]
def <- apply(y, 2, assign.method)
...

I haven't tested this through, but y should now be a matrix or data.frame and apply loops over its columns.