amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
428 stars 107 forks source link

Missing Data for Imputed Variables Post-Imputation #534

Closed VictorPorcelli closed 1 year ago

VictorPorcelli commented 1 year ago

Dataset: impdata.csv

Edit: Apologies, I forgot to include the data manipulation to get the .csv in the same format as the dataset I used. See below.

impdata <- read_csv("impdata.csv")
impdata <- impdata %>% mutate(across(c(ID, predictor3, predictor8, predictor9, predictor12), ~factor(.)), 
outcome_var2 = as.integer(outcome_var2))

Hello. I have been using mice for a while, and a common use case in my work involves situations where we would like to impute covariates or predictor variables in our dataset, but leave outcome variables unimputed. In these cases, we also tend to want to still use the outcome variables as predictors in the imputation models for the covariates.

It is in such a case, using the data I've attached here, that I had an issue using mice. The short of it is, after simply setting the method vector values for outcome variables to "" so as to not impute them and running mice, multiple predictors still have some missing data remaining. It is my understanding from reading the documentation that this is not intended, and any variables specified to be imputed should not have missing data in the resulting mids object.

Here is my code below:

# The goal: impute all predictors, using all variables as predictors BUT exclude outcomes from being imputed
# select outcomes
outcomes <- impdata %>% select(starts_with("outcome")) %>% names()

# make a predictor matrix and methods vector using mice's default methods
pred <- mice::make.predictorMatrix(impdata)
meth <- mice::make.method(impdata)

# for all outcomes, set their method to "" so they are not imputed
for(i in names(meth)){
  if(i %in% outcomes){
    meth[match(i, names(meth))] <- ""
  }
}

# impute
test_mids <- mice::mice(impdata, predictorMatrix = pred, method = meth)

# expectation: all predictors will be imputed, there is no missing data for predictors
test_mids %>% complete('long') %>% 
  select(-starts_with("outcome")) %>%
  summarise_all(~sum(is.na(.))) %>% 
  pivot_longer(everything(), names_to = "Variable", values_to = "NumberMissing") %>% 
  arrange(desc(NumberMissing)) %>% 
  print()

# result: there is missing data for predictors

The same occurs when removing outcomes from the rows of the predictor matrix and using the blocks argument:

# make a predictor matrix  using mice's default methods
pred <- mice::make.predictorMatrix(impdata)

# for all outcomes, remove them from the predictor matrix
pred <- pred[-c(2,3,4,5),]

imp_blocks <- mice::make.blocks(impdata %>% select(-starts_with("outcome")))

# impute
test_mids <- mice::mice(impdata, predictorMatrix = pred, blocks = imp_blocks)

Yet, if I stop trying to exclude the outcomes from being imputed and simply use this code:

test_mids <- mice::mice(impdata)

I end up with a complete dataset. Of course, this is expected, but I am unsure why there is no longer any missing data for the predictors simply because outcomes are imputed as well -- shouldn't this have no effect on the imputation of the predictors?

gerkovink commented 1 year ago

Removed as markup does not work via e-mail. See below for the edited response.

gerkovink commented 1 year ago

If incomplete model outcomes serve as predictors (i.e. columns in pred) for the imputation of incomplete model predictors, then the cases for which the model outcome is not observed are left unimputed. The reason is the same as why you want to impute in the first place; no information means mathematical problems in estimating the full data matrix and it's corresponding estimates. You cannot have your cake and eat it too.

Personally, I believe that both directions of the outcome and predictor relations should be taken into account during imputations, as you may otherwise render the models uncongenial. You expect a relation to be there during analysis, which may be indeed there in the observations. However, you do not allow the algorithm to produce those relations by restricting the predictor matrix.

There seems to be a persistent thought in some fields that this imputation relation may not occur. However, the imputation and analysis steps in MI with mice are quite distinct. The analysis focuses on drawing inference from a model, given that some cells were originally unobserved. The imputations focuses on drawing values that could have been from the posterior predictive distribution. Imputation thus reverse engineers the path by taking into account that the incomplete sample comes - by some mechanism - from a complete sample, which in turn comes - given some mechanism - from a finite or infinite population. The multiple analyses and pooling steps inherit this uncertainty by considering that not all data cells currently analyzed are equally certain.

If you ask me, I'd always impute the model outcome conditional on the data. After all, imputation is not prediction.

stefvanbuuren commented 1 year ago

where we would like to impute covariates or predictor variables in our dataset, but leave outcome variables unimputed. In these cases, we also tend to want to still use the outcome variables as predictors in the imputation models for the covariates.

If you use incomplete predictors in the imputation model, the imputed value will be NA. You have two choices: