amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
441 stars 107 forks source link

Get at the final model used in the MICE iterations? #32

Closed mmaechler closed 3 years ago

mmaechler commented 7 years ago

Dear Stef (et al), this is not a bug report, but a public "request" for advice.

Context: We use mice on medium sized data set of Swiss meteo and bio data, several locations, species, etc. and mainly need to impute one Y variable (which however is also used in lagged form as predictor) in a linear regression model. Imputations work fine (using "ppm" and default least squares regression ((though a perfect model would take into account that errors seem to be more heavy tailed than the Gaussian, and in an ideal world we would use robust regression (e.g. as in robustbase:: lmrob()).

To assess the imputations we would like to compare the empirical distribution of the several imputed values with a hypothesized Gaussian of "known" (mu, sigma) = (x' \beta, \sigma) and hence would want to find (\beta, \sigma) from the regression model that was used in mice (but possibly fitting \beta,\sigma using different data, e.g., in a missingness-simulation fit it to the full (nonmissing) data). Our problem is that the mice.impute.() functions which mice() works with all do not keep the parameters of the models used, but only return the predicted values - which is perfect for what they are designed to do, but leaves us without a clue about how the final model looked like.

What do you propose? I assume others have had related wishes in the past, and there already is a perfect solution?

stefvanbuuren commented 7 years ago

Dear Martin, thanks for your question.

As you've found out, there is no standard facility in mice to save the imputation model, or to apply it to some new data. The primary function of the imputation model is just to produce good imputations. The imputation model itself is of little scientific interest, and the parameters generally have no sensible interpretation, so I've never felt the desire to save and study them.

Applying a fitted imputation model to new data can probably technically be implemented, and the results should give valid inferences, of course assuming that the relations in the fitted and new are the same. The alternative is to stack the old and new data, and re-impute the stacked data. The first analysis is likely to be less efficient than the second one because it ignores the information in the new data to estimate the imputation model.

Fixing parameters in the imputation model is easily done by writing your own mice.impute.xxx(), and we can save the parameters of the last fitted model on a case-by-case basis by tempering with the relevant mice.impute.xxx() function.

Having said this, there will be scenario's where it can be useful to store and re-use the imputation model for new data. One such scenario is just to speed up the algorithm by saving the values over the iterations (the only memory in the MICE algorithm is the data). Another might be in production environments where the imputation model is estimated from old data, and needs to be controlled and fixed for any new data. Also, the stacked data may simply become too large.

It is technically non-trivial to store replay the imputation process. Model fitting only is one of the steps. It is thus not enough to just store the fitted model object. We also need to codify the procedure that uses the estimated model parameters to calculate and/or draw the imputations. Perhaps the broom package could assist us here, but it is probably not rich enough to codify the imputation model. An additional complication is that some methods (e.g. predictive mean matching) draw from the observed data, so we may need also need to store these, depending on the requirements for reproducibility. But perhaps, there are brave people out there willing to give it a try?

Stef.

ajsteele commented 7 years ago

Hello!

I too am looking for a way to use mice on new data, in the context of wanting to develop the imputation model on a training set, and then apply it to an unseen test set to allow for fair comparisons with models which do not require imputation but are trained and tested with the same split dataset.

In your answer above, you say that this can 'technically be implemented'. Do you mean with the existing mice package, or with additional code? If the former, could you give example code to use mice of how to apply the fitted imputation to new data?

Thanks very much in advance,

Andrew

RianneSchouten commented 6 years ago

@ajsteele we would need some kind of function that stores the parameters of the imputation model (or at least, the parameters of the last model). I am planning on writing this function (but I have many plans so don't know when I will do this). Just out of curiosity, what exact kind of comparisons are you doing?

ajsteele commented 6 years ago

@RianneSchouten

I've been comparing different kinds of survival model with different methods of dealing with missing values (imputation, 'missing indicator', discretisation) on medical records data. My concern is that imputing on the full dataset could allow information to 'leak' between the training and test sets, giving an unfair advantage to models using imputation…but there's no direct way to test this!

Thanks for the offer, and there's no particular hurry…this project is winding down a bit now anyway. But I think the tools could be of use to future researchers with similar problems? :)

stephenleo commented 6 years ago

Hi, To add on to above. I'm interested to use MICE imputation on a production data set that is updated every few seconds. This requires the imputation model to be exported so that it can be re-used as new data comes in. Re-running a "cart" imputation model as every new data point comes in is not realistic due to the volume of the production line. The model is not expected to change significantly over time, but I could plan in a periodic imputation model refresh to ensure the model stays up to date. Any help on this is highly appreciated. Thank you.

stefvanbuuren commented 6 years ago

How to export and re-use the imputation model?

I have done a little thinking about what we can do with the current objects that are produced in MICE. The idea as formulated by Martin is that the imputation model should be fixed at the last values. MICE only stores the imputed data, together with the model specification. All intermediate modelling coefficients of how to arrive at the imputations are discarded.

Suppose we wish to fix the following aspects of the imputation model:

  1. We want new imputations to be generated using the fitted coefficients frozen at the last iteration of the training data;
  2. We want to restrict the set of donors to those from the training model;
  3. We want to use the same procedures how to find donors given the observed cells in the new data;
  4. We want the same number of imputations in the training and new data.

I believe that it is possible to meet all four requirements exactly in MICE. We could be temped to save all regression coefficient from the latest iteration, but that does not help. The problem is that the coefficients of the current model are invalidated at the moment that one of the predictors is re-imputed, which occurs almost immediately as we move on to impute the next variable. Rather, in order to recreate the used model from iteration t, we would need to know the state of imputations at iteration t-1. Given these, all coefficients and imputations can be recalculated exactly. So, in order to apply a given imputation model to new data, all we need to do is store the imputations from the previous iteration. Of course, it is much easier to use the current imputations, and define the "last iteration" as the "first future iteration", which is equally as good as, if not better than, the last iteration. Thus, everything we need is already "exported" in the current mids object.

Recreating the next iteration can be done by the mice.mids() function. The only new thing that needs to made is that this takes an additional argument newdata. The test is that the new data do not influence the imputes for the training data. So the imputations should be the same, whether we specify newdata or not.

The procedure to achieve this is as follows:

  1. Define two streams of random numbers, one for the old data, and one for the new data. Initialize the stream for the old data from the last saved random seed taken from the mids object.
  2. In the training data, fill in the current imputation from your saved mids object;
  3. Initialize the missing data in newdata by random draws from the marginal, as usual;
  4. NOW DO THE TRICK: Before imputing $Y_1$, set all $Y_1$ in the new data temporarily to missing (this will insure all new data will be ignored by the mice.impute.xxx() imputation procedure);
  5. Impute the missing data in the training data (using the old random sequence), impute the missing data in the new data using the independent random generator, and store both sets of imputations;
  6. Go to variable $Y_2$, and repeat from step 4, until we are at the last column;
  7. Add a few extra iterations to get convergence in the new data.

As described, we need two streams of random generators if want exact correspondence (which is useful for testing), but in practice that may not be worth the trouble. We obtain a new mids object, with imputations for both the training data and the new data. Discard the training data, and do your analysis on the imputed new data.

I believe this works if

  1. the new data contains any number of new records;
  2. the records are assumed to be exchangeable (e.g., no time-dependencies beyond those abstracted in the imputation model);
  3. the new record contains any number of missing values, including completely empty records;
  4. the new data contains no new variables (but some variables in newdata may be entirely missing and will get imputed);
  5. there are no new levels in the factors (but some levels in newdata may be entirely missing and will get imputed).

Would such be useful to your application?

Stef.

IyarLin commented 6 years ago

Hi,

I second the request to have the ability to project the imputation model from a given dataset to a new dataset with the same variables for the exact utility of partition to train and test sets. @stefvanbuuren your suggestion sounds like it should do the trick.

Many thanks and kudos for the great package!

Iyar

micdonato commented 5 years ago

I would like to do the same: I have some datasets where I have only some missing values in a single column, and others where that single column is completely missing (no predictors are missing, though). I admit that I tried to understand what @stefvanbuuren was suggesting, but I have no idea how to implement that.

Is "old data" the same as "training data"? What are the "two streams of random numbers" exactly?

Is it possible to get an example? Maybe with nhanes split in two?

DavidBamat commented 4 years ago

Wondering if any examples/vignettes are available for conducting the procedure that @stefvanbuuren describes.

stefvanbuuren commented 4 years ago

Sorry, no examples yet.

prockenschaub commented 4 years ago

I have given this a try (roughly) following the strategey outlined by Stef.

Code to get imputations for previously unseen test data can be found here, with a minimal example here.

The function creates observations for the test set in the following way:

  1. Take the previously fit mids object (which contains the last set of imputations for the training data)
  2. Create a new mids object for the test data by calling mice(test_data, maxit 0). This will initialize the missing data in the test set by random draws from observed values in the newdata (this is the default when mice() is called). Note: Alternatively these could be initialised with values from training + test set, particularly if the test set is only one observation.
  3. Mark all values in the test data as missing to ensur that they are ingored by the mice.impute.xxx() imputation procedure. Put differently, we ask mice to draw imputations for every single value in the test data even if they have been observed.
  4. Combine the two mids objects into one by appending all data items ($data, $imp, $nmis, $where) of the test mids object to the trainingmids object.
  5. Simply sample via mice.mids() on the combined object. Trick: Before the first round of imputations and after each single imputation (i.e. each call to mice:::sampler.univ()), replace those values in $imp that came from the observed/non-missing test data (and were thus unnecessarily imputed because they were marked as missing despite the fact that we observed them) with the acutally observed values. This is done via the post-processing functionality of mice
  6. Run the imputation for multiple iterations so that the imputations in the test data converge
  7. Throw away the training data and keep only the imputations for the test data

On the nhanes dataset with PMM this approach seems to work (see example) but I haven't gotten around to do extensive testing. I also haven't bothered to create two streams of random numbers.

There is a good chance that I have disregarded some intricacies of mice, so please do let me know if something in my approach does not make sense or is obviously wrong.

SrGh31 commented 4 years ago

@prockenschaub Thank you so much mate! I was in the same problem- was searching for a week now about how to use mice for test data using imputed training data. Stef van Buuren kindly advised here and you made the neat function. Thanks from the bottom of my heart, you saved me weeks worth of precious time.

al-obrien commented 4 years ago

Would the work completed by @prockenschaub be a feature within the scope of a direct addition to mice? I imagine there could be many use cases for this with prediction modeling (e.g. applying the prepossessing during CV for the training and holdouts).

prediction2020 commented 4 years ago

I am pretty amazed (and shocked to a certain degree) that this is not an implemented functionality of mice, especially since this issue was raised over 3 years ago. One of the most common applications of imputation is for machine learning modelling and here the best practice clearly defines that for each split of data (and using nested cross-validation there can be many splits of data) we should perform imputation on the training set and apply the imputation model on the test or validation set. All this is necessary to avoid leakage. While I do agree that there are other applications for imputation, not providing the means to perform best practice imputation for machine learning in 2020 is really weird for a package that is generally regarded as the "leading" imputation package in the data science domain...

stefvanbuuren commented 4 years ago

Agree that it would be useful. Nothing prevents you from contributing a pull request that would add it to mice.

prediction2020 commented 4 years ago

Do no get me wrong, Stefan, I am deeply grateful for the work that volunteers do and mice is an amazing package. But not all of us have the time and the abilities to contribute to packages.

And being free or non-free, any software or package is aimed at users. And as a user I can state that currently the mice package is lacking a core-functionality given that one of the main use cases of imputation is machine learning modelling. Any software lacking a core-functionality is less useful.

What you do with this information is up to you, and - frankly - I do understand if you don´t care but you shouldn´t be surprised when people are complaining about it.

stefvanbuuren commented 4 years ago

I do care about users, but - like you - don't have the time. If someone would be able to fund, that could help.

prockenschaub commented 4 years ago

Since this feature continues to be requested and my solution appears to have helped some of the above posters, I would be more than happy to add it to mice via a pull request over the next couple of months.

@stefvanbuuren: would there be some support from a seasoned mice developer available to help me navigate some of the intricacies of the code base and unit tests?

stefvanbuuren commented 4 years ago

@prockenschaub Yes, a PR is welcome. Please let me know if there are certain parts in your solution that you want me to take a look at.

prediction2020 commented 4 years ago

@prockenschaub First of all, thanks for your work!

Two quick questions if you don´t mind:

Thanks in advance!

stefvanbuuren commented 4 years ago
prediction2020 commented 4 years ago

@stefvanbuuren Yes, I do know that mice supports this in general (we use it exactly for that, i.e. see here: https://github.com/prediction2020/missing-value-analysis

I was specifically asking whether the reuse_mice function of @prockenschaub also covers mixed-type data since - if I read it right - his linked example covers numerical data only.

(and yes, I mean mixes of numerical and categorical :-) )

prockenschaub commented 4 years ago

Nothing should prevent my function from working with mixed data. My function is mostly a wrapper around the in-built functionality of mice that only adds additional rows to the pretrained mids object and messes with the internal missingness flags to ensure that the new rows are only used in the sampling and don't lead to any further model training.

@prediction2020 if you run into any problems with mixed data, let me know and I can have a look where they might originate.

As for rpy2, I think I have seen mice work with it and there is nothing obvious that should prevent my solution to work with it, except perhaps namespace issues until it would be integrated into mice.

prediction2020 commented 4 years ago

@prockenschaub thank you! I will try to implement your function with rpy2 for mixed data in the upcoming days and will let you know if I run into any problems!

stefvanbuuren commented 4 years ago

@prockenschaub. I looked into your code. It contains many useful elements and goes a far way, but it does not fully implement the algorithm that I suggested above. Your code tweaks the where parameter to obtain imputes. This parameter specifies what cells should be imputed, but it does not make the cells missing. Hence the observed values in the test data will inadvertently leak into the imputation model.

What is needed is to change ry, a logical vector that specifies which cells are missing. If we overwrite the part of ry that corresponds to the rows with the test data with FALSE, then the imputation model will ignore the information in those records. Still, mice will produce imputations for any missing values in these records, which is precisely what we want. The problem is that we cannot change ry from the outside (hard overwriting and resetting NA's in data isn't a real option). We need to go deeper into sampler() to tamper with ry, which is probably a little challenging if you haven't written that code yourself.

I have now implemented an experimental solution in a separate branch that adds a new ignore argument to mice() that does the overwriting trick in sampler() and lower layers. In addition, I added a newdata argument to mice.mids(). These functions would support two different ways for users to fit train data and impute test data. Also, I extended the mids object to include the ignore vector.

Things are not yet perfect, and there are still a couple of thing to consider:

I welcome any feedback on this approach. Would this work for your use cases?

prockenschaub commented 4 years ago

Thanks @stefvanbuuren , this is really helpful! Yes I think this would cover my personal use cases.

What is needed is to change ry, a logical vector that specifies which cells are missing.

The difficulty that you describe in getting a hold of ry without going into mice's internals was really one of the main challenges in getting this to work. However, just to give some of those that might have used my method some peace of mind, if you run debug(mice:::sampler.univ) in my example and print ry in each iteration, you will see that it is indeed always FALSE for rows of the test set. The way this is achieved is in fact a hard overwrite of data (row 62 in my code) together with a post-processing function (row 81 in my code). I therefore believe that my code was valid (if dirty) and that any results that people might have obtained were valid and not subject to leakage. That being said, I absolutely agree with you that this is a major hack and should be replaced if possible.

I like your solution a lot, it ended up significantly simpler than I anticipated. I will have a thorough test run through your code over the weekend. With regards to the issues that you mention:

Should mice.mids() return a mids object for the combined data (as it is now) or for the newdata only (am leaning towards the latter)?

I tend to agree with a view to my immediate use case that returning newdata only is the more obvious choice, but I could imagine cases where returning both might be preferrable. For example, if I want to continuously update the imputation with new data, convergence would be achieved much quicker. However, thinking about it more, this is probably easily achieved by simply combining the mids.objects of the new and old data via mice:::rbind.mids? (By the way, I completely missed that mice:::rbind.mids existed and shoddily implemented it myself...)

Plotting convergence with the combined mids-object produces by mice.mids() is currently not possible.

Am I right in my interpretation that one could plot the convergence of the entire data (train + test) but one could not get means / sdevs for the test set only?

No tests or vignettes yet.

I will try to create some examples and tests for this over the weekend.

While the approach appears compatible with the existing univariate imputation routines in mice, it doesn't work with multivariate imputation function like mice.impute.jomoImpute.

I suppose this is because mice simply wraps the functionality of mitml? Therefore, extending this to the multivariate imputation functions would require a similar ignore parameter in their interfaces? Since all currently supported multivariate imputation methods come from the same package, I could raise this issue on their github page and see if this can be considered for their package as well.

And, perhaps some other issues might surface

It wouldn't surprise me but I think the work so far is a great start :)

stefvanbuuren commented 4 years ago

@prockenschaub

Additional point:

EviVal commented 3 years ago

How could we use the function mice.use provided by prockenschaub with only one observation?

prockenschaub commented 3 years ago

I guess you ran into an 'mice' detected constant and/or collinear variables. No predictors were left after their removal.?

Constant and collinearity remove is turned on by default in mice and excludes your observation in the setup of the mids object for the test data in the mice.reuse function. You can replace line 50 in mice.reuse.R

mids.new <- mice(newdata, mids$m, where = all_miss, maxit = 0)

with

mids.new <- mice(newdata, mids$m, where = all_miss, maxit = 0, remove.collinear = FALSE, remove.constant = FALSE)

and I think that should do the trick. I have also updated my code to include this.

EviVal commented 3 years ago

Yes it did the trick! Many thanks!!!!

stefvanbuuren commented 3 years ago

Commit 46171f9 merges the work in the ignore argument and the filter() method to be main version, so the functionality is now available in mice 3.11.7. Thanks @prockenschaub for leading this work.

stefvanbuuren commented 3 years ago

mice 3.12.0 now includes the ignore argument. Some comments:

mice.impute.norm <- function(y, ry, x, wy = NULL, ...) {
  if (is.null(wy)) wy <- !ry
  x <- cbind(1, as.matrix(x))
  parm <- .norm.draw(y, ry, x, ...)
  x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma
}
mice.impute.normdump <- function (y, ry, x, wy = NULL, ...) {
  if (is.null(wy)) wy <- !ry
  x <- cbind(1, as.matrix(x))
  parm <- .norm.draw(y, ry, x, ...)
  betadump <<- c(betadump,parm$beta) 
  x[wy, ] %*% parm$beta + rnorm(sum(wy)) * parm$sigma
}

Closing now. Thanks all for bringing up and discussing the issue. It makes up for a better mice.

EviVal commented 3 years ago

Hey prockenschaub !

I am using mice.reuse and I get this warning: invalid factor level, NA generated

The problem seems to be at mids.append in the loop:

for(i in names(x$imp)){ if(i %in% miss_xy){

Imputations

  app_imp <- y$imp[[i]]
  rownames(app_imp) <- y_idx[rownames(app_imp)]
  app$imp[[i]] <- rbind(x$imp[[i]], app_imp)

  # nmis
  app$nmis[[i]] <- x$nmis[[i]] + y$nmis[[i]]
}

}

However, I really can't understand why. The train and test set have exactly the same variable names and all predictors have the same factor levels. Any ideas?

Many thanks!

prockenschaub commented 3 years ago

@EviVal The error originates from my less then ideal set-up in my first code. I use mice to set up a mids object for both training and test, and then combine both. This implicitely assumes that both train and test data.frames have the same structure AND also contain all factor levels. If test is very small and for example only contains missing values for variable x, the mids object for train will look different than the mids object for test (the second does not know about the factor levels).

Reproducible example

library(mice)
library(tidyverse)

# Make sure to store `mice.reuse.R` in the same directory or change path
source("mice.reuse.R")

set.seed(42)
data <- data.frame(
  x = rnorm(100),
  z = factor(rep(c("a", "b", "c", "d"), each = 25))
)
data$z[runif(100) < 0.2] <- NA
data$z[100, "z"] <- NA # set the last row definitely to missing
#> Error in x[...] <- m: incorrect number of subscripts on matrix

imp.train <- mice(data[1:99, ], maxit = 5, m = 2, seed = 1)
#> 
#>  iter imp variable
#>   1   1  z
#>   1   2  z
#>   2   1  z
#>   2   2  z
#>   3   1  z
#>   3   2  z
#>   4   1  z
#>   4   2  z
#>   5   1  z
#>   5   2  z
imp.train
#> Class: mids
#> Number of multiple imputations:  2 
#> Imputation methods:
#>         x         z 
#>        "" "polyreg" 
#> PredictorMatrix:
#>   x z
#> x 0 1
#> z 1 0

imp.test <- mice.reuse(imp.train, data[100, ], maxit = 1)
#> Warning in `[<-.factor`(`*tmp*`, ri, value = -0.592225372961588): invalid factor
#> level, NA generated

#> Warning in `[<-.factor`(`*tmp*`, ri, value = -0.592225372961588): invalid factor
#> level, NA generated
#> 
#>  iter imp variable
#>   6   1  z
#>   6   2  z

Solution

Do not use my code but instead use the ignore argument in the new mice version 3.12.0, which should be able to deal with this just fine.

imp.ignore <- mice(data, ignore = c(rep(FALSE, 99), TRUE), maxit = 5, m = 2, seed = 1)
#> 
#>  iter imp variable
#>   1   1  z
#>   1   2  z
#>   2   1  z
#>   2   2  z
#>   3   1  z
#>   3   2  z
#>   4   1  z
#>   4   2  z
#>   5   1  z
#>   5   2  z
arjunbazinga commented 3 years ago

Hi! I'm unable to find mice.reuse after importing mice, i checked sessionInfo and i'm using mice_3.12.0 , is there something i need to do to enable it ?

Sorry if it's the wrong place to ask

thanks Arjun

prockenschaub commented 3 years ago

mice.reuse was my own hacked function and is not part of the mice package (you can still find it here but I wouldn't recommend using it anymore).mice::mice version 3.12.0 contains the ignore parameter that does the same thing in one go. Simply pass it a vector with FALSE for all rows that should be used during training and TRUE for all rows that should only be imputed (but not used during training). See the proposed solution in my example two comments ago for an idea on how to use it.

Thanks for reminding me, though, I urgently need to write a short vignette on using ignore. This probably won't happen this side of Christmas but I hope to add this early next year.

Edit (22/03/2021): @AmeBol kindly pointed out that I mixed up the role of TRUE and FALSE in my description of the ingore paramter above. Fixed now and in line with the help pages of mice.