amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
446 stars 108 forks source link

Request: make predictions possible after using pool() #82

Open oren0e opened 6 years ago

oren0e commented 6 years ago

When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the glmnet package) and then pooling the estimates to a single set - its not possible to use predict.glmnet() (or predict()) on the object that comes out of mice::pool(). The prediction functions expect a fitted model object while mice::pool() returns

An object of class mipo, which stands for 'multiple imputation pooled outcome'.

Is it possible to add the ability to make predictions from the return object of mice::pool()?

stefvanbuuren commented 6 years ago

There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.

Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.

It should be relatively straightforward to implement any of these approaches in mice.

mnarayan commented 5 years ago

Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.

SolomonMg commented 4 years ago

Any word on this feature? Would be very helpful!

stefvanbuuren commented 4 years ago

I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.

A

One data set: train, $n$ rows, $m$ imputations:

1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)

B

Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.

1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)

A and B are extreme scenarios, and obviously, we can mix aspects of both.

What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.

cryanking commented 3 years ago

I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.

JonNDCN commented 3 years ago

I'd be very happy with A.

angelrodriguez2020 commented 3 years ago

Dear all, I'd like to know the current status of this issue. In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:

fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial)) for (i in 1:13) { assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]])) assign(paste0("fit",i), names<-(get(paste0("fit",i)), c("hi1", "pred"))) } fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"), list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13)) fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE) fittot <- fittot[c(1,15)]

Thank you.

ASKurz commented 3 years ago

Option A would be spectacular. It's exactly what I'd want to teach to my students.

stjeandenise commented 3 years ago

Another vote for Option A!! I came looking for exactly that.

ASKurz commented 3 years ago

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

stjeandenise commented 3 years ago

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

🤗 you are my favorite person @ASKurz

ASKurz commented 3 years ago

Cheers! 🍻

JonNDCN commented 3 years ago

Thank you so much @ASKurz, just what I needed

stefvanbuuren commented 3 years ago

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

markdanese commented 2 years ago

FWIW, predm[[1]] doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null assuming the null df is the correct one. Apologies if I am missing something obvious.

stefvanbuuren commented 2 years ago

@markdanese Thanks for flagging this. The code is an example. It works for lm(), but needs tweaking for other models.

RStellato commented 2 years ago

Perhaps we could evolve this into a predict.mira() function.

Would really appreciate that implementation!

reetakankaanpaa commented 1 year ago

Hi, the option A above looks tempting, I'm trying to plot predicted values from lme model using multilevel multiple imputed longitudinal data.

However, when I run the code on the third step, predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE) I repeatedly get an error: 'augment.xxx' is not an exported object from 'namespace:broom', regardless of whether I use broom.mixed::augment.lme.

I tried it with nlme Orthodent data too, using fit <- with(imp, lm(distance~age*Sex, data = Orthodont)) predlist <- lapply(fit$analysis, broom::augment, se.fit = TRUE, data = Orthodont) predmat <- pool(predlist) Here, I get an error: Error in object$analyses[[1]] : subscript out of bounds

Could someone please give advice, how to get fitted values for plotting?

I also found the instructions from Haile, Sarah, 2019, Multiple Imputation using R, and noticed that "However van Buuren appears to disagree with this approach, which is perhaps why it’s not part of mice yet."

stefvanbuuren commented 1 year ago

Your code says fit$analysis. Does it help to say fit$analyses?

I don't know Haile.

ne-jos commented 1 year ago

I am trying to calculate predicted probabilities from a multiply imputed mixed effect logistic regression model:

df_select <- df_collapsed %>% select("op.day","age.admission.nmds", "gender", "ethnicity2", "asa2", "C3.score.categories", "extent.nzcr", "site", "dep13.nmds.quint", "op.year", "mortality.90d", "dhb.nmds", "days.to.surg") imp <- mice(df_select, m=10) fit <- with(imp, glmer(mortality.90d ~ op.day + age.admission.nmds + gender + ethnicity2 + asa2 + C3.score.categories + extent.nzcr + site + dep13.nmds.quint + op.year + days.to.surg + (1 | dhb.nmds), family = binomial()))

I have tried option A but I am running into the same problem as mentioned above, swtiching 'analysis' for 'analyses' does not seem to help. Any suggestions would be greatly appreciated.

thomvolker commented 1 year ago

Perhaps using broom.mixed:::augment.merMod will give you the correct predictions. However, pool() will not work, because pooling currently only works on the level of the analyses (i.e., pooling the estimated parameters), not on the level of the data (i.e., pooling the predictions). If you want to pool the predictions, you will have to average the predictions and calculate the variances manually.

stefvanbuuren commented 1 year ago

Please look at #32, #299 and the ignore argument in mice()

Van: jeandigitale @.> Datum: dinsdag, 22 augustus 2023 om 20:58 Aan: amices/mice @.> CC: Buuren, S. van (Stef) @.>, Comment @.> Onderwerp: Re: [amices/mice] Request: make predictions possible after using pool() (#82) CAUTION: This email originated from outside of Utrecht University. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Like @cryankinghttps://github.com/cryanking, I'm trying to use imputation in a real-time prediction pipeline and am looking for a way to estimate imputation models with training data and then apply those models to test data without having to re-estimate them. I have been searching for a package that has this feature with no luck yet!

— Reply to this email directly, view it on GitHubhttps://github.com/amices/mice/issues/82#issuecomment-1688755576, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA55ALYXMTH6THPAAJMQ2ZDXWT6OLANCNFSM4EZN5F5A. You are receiving this because you commented.Message ID: @.***>

fthielen commented 5 months ago

For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

This link has changed and is now: https://solomonkurz.netlify.app/blog/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/

fthielen commented 5 months ago

The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.

library(mice)
library(dplyr)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df

# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
                 dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
  pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
  pred[i, 1] <- pi[["qbar"]]
  pred[i, 2] <- sqrt(pi[["t"]])
  pred[i, 3] <- pi[["df"]]
}

Perhaps we could evolve this into a predict.mira() function.

Is this the most recent solution for lm() models? predict.mira() does not seem to be implemented (yet).

gerkovink commented 5 months ago

Voilà. Will need to think about the default choices that are made in predict.lm() before we implement this further.

library(purrr)
library(magrittr)
library(dplyr)
library(mice)

# generate data
set.seed(201)
b_small <-
  brandsma %>%
  filter(!complete.cases(.)) %>%
  slice_sample(n = 50) %>%
  select(ses, iqv, iqp)

imp <- b_small %>% 
  mice(seed = 540, m = 10, method = "norm", print = FALSE)

# fit with map() workflow
fit1 <- imp %>% 
  complete("all") %>%
  map(~.x %$% lm(iqv ~ 1 + ses))

# fit with with.mids() workflow
fit2 <- with(imp, lm(iqv ~ 1 + ses))

# function to pool individual values cf. Rubin's rules
predict.mira <- function(fit, tiny = TRUE){
  # preprocess to allow for flexible workflows
  if("mira" %in% class(fit)){
    fit <- fit %>% 
      .$analyses # extract the fitted analyses
  } 
  # create predictions and intervals
  pred <- fit %>% 
    map(~.x %>%  
          predict(se.fit = TRUE) %>% # create predictions with corresponding SE
          as.data.frame() %>% 
          select(fit, se.fit, df)) # purge irrelevant

  # pool predictions
  # convert predictions to array and call pool.scalar over dimensions 1 and 2
  pooled <- array(unlist(pred), dim = c(dim(pred[[1]]), length(pred))) %>% 
    apply(., c(1), function(x) pool.scalar(Q = x[1,], 
                                           U = x[2,]^2, 
                                           n = x[3,1] + 1)) %>% 
    sapply(unlist) %>% 
    t() %>% 
    as_tibble() %>% 
    mutate(se.fit = sqrt(t))

  if(!tiny){
    return(pooled)
  } else 
    return(pooled %>% select(qbar, se.fit, df))
}

#tiny output
predict.mira(fit1)
#> # A tibble: 50 × 3
#>       qbar se.fit    df
#>      <dbl>  <dbl> <dbl>
#>  1 -1.57    0.489 27.2 
#>  2 -1.12    1.29   3.45
#>  3 -0.578   0.335 29.3 
#>  4  2.40    0.719 31.1 
#>  5  0.450   0.865  4.94
#>  6  0.239   1.10   3.36
#>  7  0.0420  0.315 32.0 
#>  8 -0.578   0.335 29.3 
#>  9  0.0420  0.315 32.0 
#> 10  1.03    0.430 32.7 
#> # ℹ 40 more rows
predict.mira(fit2)
#> # A tibble: 50 × 3
#>       qbar se.fit    df
#>      <dbl>  <dbl> <dbl>
#>  1 -1.57    0.489 27.2 
#>  2 -1.12    1.29   3.45
#>  3 -0.578   0.335 29.3 
#>  4  2.40    0.719 31.1 
#>  5  0.450   0.865  4.94
#>  6  0.239   1.10   3.36
#>  7  0.0420  0.315 32.0 
#>  8 -0.578   0.335 29.3 
#>  9  0.0420  0.315 32.0 
#> 10  1.03    0.430 32.7 
#> # ℹ 40 more rows
# full output
predict.mira(fit1, tiny = FALSE)
#> # A tibble: 50 × 29
#>        m    qhat1    qhat2  qhat3   qhat4   qhat5  qhat6  qhat7   qhat8   qhat9
#>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#>  1    10 -1.34    -1.73    -1.90  -1.81   -1.58   -1.46  -1.18  -1.41   -1.47  
#>  2    10  0.135    0.0911  -0.663 -0.291  -2.49   -3.18  -2.03  -1.06   -0.122 
#>  3    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  4    10  1.97     2.54     2.42   2.79    2.05    2.61   2.49   2.14    2.12  
#>  5    10 -0.794   -0.130    1.18   0.263   0.818  -0.448  0.973  1.57    0.275 
#>  6    10 -1.25     1.64     1.14   0.552   0.0164 -1.01   0.966 -0.796   0.149 
#>  7    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#>  8    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  9    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#> 10    10  0.835    1.07     0.933  1.21    0.803   1.21   1.23   0.921   0.886 
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> #   u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> #   ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>
predict.mira(fit2, tiny = FALSE)
#> # A tibble: 50 × 29
#>        m    qhat1    qhat2  qhat3   qhat4   qhat5  qhat6  qhat7   qhat8   qhat9
#>    <dbl>    <dbl>    <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl>
#>  1    10 -1.34    -1.73    -1.90  -1.81   -1.58   -1.46  -1.18  -1.41   -1.47  
#>  2    10  0.135    0.0911  -0.663 -0.291  -2.49   -3.18  -2.03  -1.06   -0.122 
#>  3    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  4    10  1.97     2.54     2.42   2.79    2.05    2.61   2.49   2.14    2.12  
#>  5    10 -0.794   -0.130    1.18   0.263   0.818  -0.448  0.973  1.57    0.275 
#>  6    10 -1.25     1.64     1.14   0.552   0.0164 -1.01   0.966 -0.796   0.149 
#>  7    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#>  8    10 -0.512   -0.665   -0.821 -0.661  -0.674  -0.439 -0.266 -0.520  -0.573 
#>  9    10  0.00583  0.00304 -0.146  0.0569 -0.106   0.196  0.308  0.0340 -0.0114
#> 10    10  0.835    1.07     0.933  1.21    0.803   1.21   1.23   0.921   0.886 
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> #   u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> #   ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>

Created on 2024-06-10 with reprex v2.1.0

stefvanbuuren commented 5 months ago

Since predict() is pretty well adopted and implemented, I assume this might work more generally, i.e., for any object for which there is a predict method. Right?

Would there be value in using broom::augment for this feature?

gerkovink commented 5 months ago

Yes; should work universally. Don't know if broom::augment() would apply to all of those.

darentsai commented 2 months ago

Dear @stefvanbuuren

Thank you for the example code for model prediction using imputed data. In that example, what you actually calculated were the "fitted values", not the "predicted values", because you didn't input a independent test dataset into predict() function.

If there is a separate test dataset without any missing values, assumed to be collected in the future,

b_small_test <- brandsma %>%
  filter( complete.cases(.) ) %>%  # This line preserves all the complete records
  slice_sample(n = 20) %>%
  select(ses, iqv, iqp)

we can just add newdata = b_small_test into predict(), and the remaining part works perfectly.

# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, newdata = b_small_test, se.fit = TRUE)

However, more realistically, if the test dataset also includes missing values,

b_small_test_with_na <- mice::ampute(b_small_test)$amp

then newdata = b_small_test_with_na will lead to the predictions also being missing values. How can the example code be modified to handle this situation?


I think we can use the imputed dataset from the model training phase to fill in missing values in the test set.

imp_test <- mice.mids(imp, newdata = b_small_test_with_na, maxit = 5)

Unfortunately, imp_test is still a multiply imputed object with m=10 imputed results, and I cannot figure out how to accommodate it into lapply(getfit(fit), predict, se.fit = TRUE).