Closed ShaunFChen closed 3 years ago
This problem occurs because variable x
is used to impute variable y
. However, the cases that you want to impute in y
have missings in x
as well, so that x
cannot be used to impute y
. Because you don't impute the missing x
values, y
cannot be imputed for these cases because part of the information that is needed to impute y
is missing.
When imputing the whole dataset, the missing values in x
are imputed, and these imputations are subsequently used to impute y
. Subsequently, the imputations in y
are used to impute x
in the next iteration, and so on. Hence, if you want to do partial imputation, you have to make sure that the variables that are used in the imputation model of your variable do not have missing values for those cases that you want to impute. Additionally, the values that you exclude from the imputation model, should not be related to the variables that you want to impute, nor should these variables affect the probability that an observation is missing on one of the variables that you try to impute.
Well explained.
mice
propagates missing values in the predictors. Under the default, all missing data are imputed everywhere, so there will be no propagation. Use where
in combination with the predictorMatrix
argument to evade missing values after running mice()
.
I was trying to impute a large dataset exceeding the memory limitation of my machine. Therefore, I decided to use 'where' argument for partial imputation. However,
mice
(version 3.13.0) failed to impute the target column only when 'where' was assigned. Below is the code to reproduce the issue.With the
na.df
specifying "y" column as the only target to be imputed, the target missing values didn't get imputed in the resultdf.imp.mice.y
.Instead, using default value of 'where' (as
is.na(df)
) resulting the expected outcomes for all the missing values.Other users reported the similar case in the URL: https://stackoverflow.com/questions/49977564/mice-partial-imputation-using-where-argument-failing while the response only described the potential reasons but didn't explain why the default settings can still work to impute the full dataframe as a whole, while partial imputation didn't allow missing values in the other columns. Any insight into this would be greatly helpful! Thank you.