amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
444 stars 107 forks source link

Question about Partial Imputation with 'where' Argument #409

Closed ShaunFChen closed 3 years ago

ShaunFChen commented 3 years ago

I was trying to impute a large dataset exceeding the memory limitation of my machine. Therefore, I decided to use 'where' argument for partial imputation. However, mice (version 3.13.0) failed to impute the target column only when 'where' was assigned. Below is the code to reproduce the issue.

library(mice)
# sample data
a <- as.factor(c("0", "0", "0", "0", "0", "1", "0", "0", "0", "1", "0", "1", "0", "0", "0", "0", "0", "0", "0", "0"))
b <- as.factor(c("0", "0", "1", "1", "0", "0", "0", "0", "0", "0", "1", "0", "0", "0", "1", "0", "1", "0", "1", "0"))
c <- as.factor(c("0", "0", "1", "0", "0", "0", "1", "0", "1", "0", "0", "0", "0", "0", "1", "0", "0", "0", "1", "0"))
x <- as.factor(c("0", "0", "0", "0", "1", "0", "0", NA, "0", "1", NA, "0", "1", NA, NA, "0", "1", "0", "0", "1"))
y <- as.factor(c("1", "1", "1", "1", "0", "1", "0", NA, "1", "0", NA, "1", "0", NA, NA, "1", "0", "1", "1", "0"))
df <- data.frame(a, b, c, x, y)
df
   a b c    x    y
1  0 0 0    0    1
2  0 0 0    0    1
3  0 1 1    0    1
4  0 1 0    0    1
5  0 0 0    1    0
6  1 0 0    0    1
7  0 0 1    0    0
8  0 0 0 <NA> <NA>
9  0 0 1    0    1
10 1 0 0    1    0
11 0 1 0 <NA> <NA>
12 1 0 0    0    1
13 0 0 0    1    0
14 0 0 0 <NA> <NA>
15 0 1 1 <NA> <NA>
16 0 0 0    0    1
17 0 1 0    1    0
18 0 0 0    0    1
19 0 1 1    0    1
20 0 0 0    1    0

With the na.df specifying "y" column as the only target to be imputed, the target missing values didn't get imputed in the result df.imp.mice.y.

# partial imputation
na.df <- is.na(df)

# ignore columns except 'y'
na.df[, colnames(na.df) != "y"] <- FALSE

df.mice.y <- mice(data = df, where = na.df, m = 1, method = "rf", maxit = 5, seed = 123)
df.imp.mice.y <- complete(df.mice.y)
df.imp.mice.y
   a b c    x    y
1  0 0 0    0    1
2  0 0 0    0    1
3  0 1 1    0    1
4  0 1 0    0    1
5  0 0 0    1    0
6  1 0 0    0    1
7  0 0 1    0    0
8  0 0 0 <NA> <NA>
9  0 0 1    0    1
10 1 0 0    1    0
11 0 1 0 <NA> <NA>
12 1 0 0    0    1
13 0 0 0    1    0
14 0 0 0 <NA> <NA>
15 0 1 1 <NA> <NA>
16 0 0 0    0    1
17 0 1 0    1    0
18 0 0 0    0    1
19 0 1 1    0    1
20 0 0 0    1    0

Instead, using default value of 'where' (as is.na(df)) resulting the expected outcomes for all the missing values.

# full imputation
df.mice.full <- mice(data = df, where = is.na(df), m = 1, method = "rf", maxit = 5, seed = 123)
df.imp.mice.full <- complete(df.mice.full)
df.imp.mice.full
   a b c x y
1  0 0 0 0 1
2  0 0 0 0 1
3  0 1 1 0 1
4  0 1 0 0 1
5  0 0 0 1 0
6  1 0 0 0 1
7  0 0 1 0 0
8  0 0 0 1 0
9  0 0 1 0 1
10 1 0 0 1 0
11 0 1 0 0 1
12 1 0 0 0 1
13 0 0 0 1 0
14 0 0 0 1 0
15 0 1 1 0 1
16 0 0 0 0 1
17 0 1 0 1 0
18 0 0 0 0 1
19 0 1 1 0 1
20 0 0 0 1 0

Other users reported the similar case in the URL: https://stackoverflow.com/questions/49977564/mice-partial-imputation-using-where-argument-failing while the response only described the potential reasons but didn't explain why the default settings can still work to impute the full dataframe as a whole, while partial imputation didn't allow missing values in the other columns. Any insight into this would be greatly helpful! Thank you.

thomvolker commented 3 years ago

This problem occurs because variable x is used to impute variable y. However, the cases that you want to impute in y have missings in x as well, so that x cannot be used to impute y. Because you don't impute the missing x values, y cannot be imputed for these cases because part of the information that is needed to impute y is missing.

When imputing the whole dataset, the missing values in x are imputed, and these imputations are subsequently used to impute y. Subsequently, the imputations in y are used to impute x in the next iteration, and so on. Hence, if you want to do partial imputation, you have to make sure that the variables that are used in the imputation model of your variable do not have missing values for those cases that you want to impute. Additionally, the values that you exclude from the imputation model, should not be related to the variables that you want to impute, nor should these variables affect the probability that an observation is missing on one of the variables that you try to impute.

stefvanbuuren commented 3 years ago

Well explained.

mice propagates missing values in the predictors. Under the default, all missing data are imputed everywhere, so there will be no propagation. Use where in combination with the predictorMatrix argument to evade missing values after running mice().