amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
442 stars 107 forks source link

Non-reproducibility and failed imputations between versions #76

Closed dmaltschul closed 6 years ago

dmaltschul commented 6 years ago

Hello,

I recently tried to reproduce results from code I wrote several months ago, and I've run into some issues, primarily that mice isn't imputing any of the missing values I want.

I wanted to raise this as an issue because I had no trouble with these imputations using an earlier version (in the summer of 2017 - I am not sure of the version), and I used them in a prediction analysis to get very reasonable validation scores, so there wasn't anything wrong with the imputations.

The code runs, doesn't give any errors, but none of the imputed datasets now have any of the NAs filled in. This is somehow an issue involving the dataset, as I've tried running examples with mice using other datasets like nhanes, and they work fine.

The only clue I have is from loggedEvents, the results of which I pasted below.

  it im co dep meth out
1   1  1 19  19  pmm   4
2   1  1 26  26  pmm  29
3   1  1 27  27  pmm  30
4   1  2 19  19  pmm   4
5   1  2 26  26  pmm  29
6   1  2 27  27  pmm  30
7   1  3 19  19  pmm   4
8   1  3 26  26  pmm  29
9   1  3 27  27  pmm  30
10  2  1 19  19  pmm   4
11  2  1 26  26  pmm  29
12  2  1 27  27  pmm  30
13  2  2 19  19  pmm   4
14  2  2 26  26  pmm  29
15  2  2 27  27  pmm  30
16  2  3 19  19  pmm   4
17  2  3 26  26  pmm  29
18  2  3 27  27  pmm  30
19  3  1 19  19  pmm   4
20  3  1 26  26  pmm  29
21  3  1 27  27  pmm  30
22  3  2 19  19  pmm   4
23  3  2 26  26  pmm  29
24  3  2 27  27  pmm  30
25  3  3 19  19  pmm   4
26  3  3 26  26  pmm  29
27  3  3 27  27  pmm  30

(The numbers in dep and out are the column names, which I've anonymized as numbers - dataset itself is attached. Not all columns were imputed, the first 3 and last 6 in particular were left out.) I've read the documentation, and I don't entirely understand what this output means, but I thought it might be elucidating for the authors and others. What does seem to be the case is that there are a few problem columns, though this looks like it is a only a few, not all of them, and again, this wasn't an issue previously - all columns have successfully imputed.

I used the randomForest method originally, though changing to pmm or other methods makes no difference, the values remain NA. I'm no expert in multiple imputation, but I'm quite baffled.

Thanks for any help you can offer!

mice-ex.zip

gerkovink commented 6 years ago

Hi dmaltschul,

It seems that you have entered variables that indicate the missingness on other variables into the imputation model. As a result, these variables take on a value (say 0) while the corresponding cell for another variable is always NA. For example column 4 is always 0 when column 19 is NA. In mice such systems are removed by default as they are computationally unsolvable - there is zero covariance for columns 4 and 19 when column 4 takes on the value 0.

I can think of two reasons why this dependency occurs:

  1. one variable is a missing data indicator for the other variable;
  2. one variable is a contingency variable that depends on the other variable such that its missingness is bonafide (e.g. if no job, then no income out of labour).

In scenario 1, the indicator should be excluded (the same information is captured in the incomplete variable). In scenario 2, the bonafide missingness should not be imputed. We are currently working on new ways of taking bonafide missingness into account when such variables serve as predictors for other incomplete variabels.

So, your conclusion that the choice of methods makes no difference is correct. The highly dependent systems for variables 4, 29 and 30 are avoided by removing these variables and keeping their dependent counterparts. This is not an error, but a means of still being able to computationally solve the system as a whole.

All the best,

Gerko

dmaltschul commented 6 years ago

Hi Gerko,

Thanks for weighing in. So, I went ahead and eliminated those variables from the dataset entirely, but I'm still having the same problem, and there's nothing in loggedEvents now either. The central problem is that mice isn't filling in the NAs, it is returning imputed datasets that are all the same as the original.

Could it be my syntax? I wouldn't think so since I'm not changing much mice.t = mice(df, method = c('','','pmm',... ), # rest are all 'pmm' except the last six diagnostics=TRUE, m = 10, maxit = 10)

(I've been keeping the arguments smaller than I would normally since I've been running this over and over.)

gerkovink commented 6 years ago

I am not able to replicate this. See the below example.

require(mice) data <- read.csv(file = "mice-ex.csv", header = FALSE) imp <- mice(data[, -c(4, 29, 30)], meth = "pmm", m = 2, maxit = 1) imp$loggedEvents apply(is.na(complete(imp)), 2, sum)

[1] FALSE

FALSE here indicates that there are no missings anymore in the first imputed data set that is by default returned by mice::complete(). Two reasons I can think of why your data still has missings:

  1. If you set empty imputation methods via method = c('', '', 'pmm', etcetera), you exclude variables from imputation, meaning that the first two and the last six variables are not imputed.
  2. If you use complete(mice.t, "long", include = TRUE): the include = TRUE statement includes the original [i.e. incomplete] data on top of the imputed data sets. So, the first cases are the original data, the second set of cases represent the first imputed datasets, and so on.

All the best,

Gerko

dmaltschul commented 6 years ago

Okay, that works for me too, which is great. I was able to isolate what seems to be causing the problem. When I tell method to include the information from certain columns but not impute them (e.g. the '' entries, see above), then mice is skipping through all of the columns, despite most of them having a 'pmm' or 'cart' or 'rf' assigned for their method.

So when I run your exact code but replace the meth argument 'pmm' with, say, meth = c('','','pmm','pmm','pmm','pmm','pmm','pmm','pmm','pmm', 'pmm','pmm','pmm','pmm','pmm','pmm','pmm','pmm','pmm','pmm', 'pmm','pmm','pmm','pmm','pmm','pmm','pmm','pmm', 'pmm','','','','','','')` Then I get the problem again. Is there something obvious here I am missing?

gerkovink commented 6 years ago

Your variables are still set to serve as predictors as specified by your predictor matrix. If you exclude the variables that have no imputation method from the predictor matrix, the problem disappears (see code below).

require(mice) data <- read.csv(file = "mice-ex.csv", header = FALSE) ini <- mice(data[, -c(4, 29, 30)], maxit = 0) exclude <- c(1:2, 30:35) meth <- ini$method meth[exclude] <- "" pred <- ini$predictorMatrix pred[, exclude] <- 0 imp <- mice(data[, -c(4, 29, 30)], meth = meth, pred = pred, m=2, maxit = 1) imp$loggedEvents any(is.na(complete(imp)[, -exclude]))

[1] FALSE

Best,

Gerko

gerkovink commented 6 years ago

See also Issue #75

dmaltschul commented 6 years ago

Hmm okay, yes, that does seem like the same issue.

So I suppose I can just impute all the variables and not use the ones I don't want in actual analyses, because I definitely do want the information in those vars to be used via the predictor matrix. That would get me to the same place as in earlier versions, I think. I was originally leaving out imputations for some predictor vars to save computational time, but I don't think it ended up making that much of a difference when I just left the imputations to run overnight.

Thanks for clearing that up.