Open oren0e opened 6 years ago
There are actually multiple ways to create predictions from multiply imputed data. See Obtaining Predictions from Models Fit to Multiply Imputed Data for a comparison of two methods. I tend to deviate from Miles' recommendation, and would generally prefer Predict then combine
because that corresponds to Rubin's rules and accounts for nonlinearities at the proper place.
Also see the discussion by Wood et al., who highlight the decisions to be made when accounting for model optimism.
It should be relatively straightforward to implement any of these approaches in mice
.
Just want to add that this feature would be quite useful. I'm a bit surprised by the recommendations in the Miles paper, it doesn't really show the regimes where combine and predict would break down substantially.
Any word on this feature? Would be very helpful!
I have done some thinking on mice prediction workflows. Here are two possibilities, one involving only a train datasets, another involving both a train and test dataset.
One data set: train, $n$ rows, $m$ imputations:
1. Standard mice: imp <- mice(train, …)
2. Standard complete-data analysis: fit <- with(imp, mymodel, …)
3. Calculate predictions: predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE, …) - We may need the data argument for some procedures. Perhaps use map().
4. Apply Rubin's rules to pool predictions: predmat <- pool(predlist, …), which results in a n * 3 matrix: .fitted, .residuals, .se.fit
5. Bind to data: ready <- bind_cols(train, predmat)
augment.mira()
function, a kind of meta-augment.—
Two data sets: train ($n_0$ rows) and test ($n_1$ rows), $m$ imputations.
1. Standard mice: imp.train <- mice(train, …)
2. Impute test data: imp.test <_ mice.mids(imp.train, newdata = test)
3. fit <- with(imp.test, mymodel, …)
4. Apply steps A3 and A4
5. Bind to data: ready <- bind_cols(test, predmat)
—
A and B are extreme scenarios, and obviously, we can mix aspects of both.
What I would like to know is whether A or B as described would be useful for applications. Suggestions for alternatives and tweaks are welcome.
I came to request exactly option B. My application is to use MICE as part of a processing pipeline for predictive models, where the standard is to have a true hold-out set. In deployment I also don't want to update the estimation - just apply it (including the imputation model) to the example at hand. It's kind of complicated with predictive mean matching because (I think) the complete training dataset is needed at inference time to select the replacements at each step. Exporting the model to something light weight would be huge.
I'd be very happy with A.
Dear all, I'd like to know the current status of this issue. In fact, I only need the pooled predicted values, not their standard errors, so I suppose averaging the predictions would be enough. If mice does not provide that yet, would this script be sensible (13 imputations)?:
fit <- with(impsppb, glm(compo ~ edad+sexo+fumarw2,family = quasibinomial))
for (i in 1:13) {
assign(paste0("fit",i), data.frame(fit[["analyses"]][[i]][["data"]][["hi1"]],fit[["analyses"]][[i]][["fitted.values"]]))
assign(paste0("fit",i), names<-
(get(paste0("fit",i)), c("hi1", "pred")))
}
fittot<-Reduce(function(x,y) merge(x = x, y = y, by = "hi1"),
list(fit1,fit2,fit3,fit4,fit5,fit6,fit7,fit8,fit9,fit10,fit11,fit12,fit13))
fittot$predef <- rowMeans(fittot[,c(2:14)], na.rm=TRUE)
fittot <- fittot[c(1,15)]
Thank you.
Option A would be spectacular. It's exactly what I'd want to teach to my students.
Another vote for Option A!! I came looking for exactly that.
For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/
For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/
🤗 you are my favorite person @ASKurz
Cheers! 🍻
Thank you so much @ASKurz, just what I needed
The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.
library(mice)
library(dplyr)
# generate data
set.seed(201)
b_small <-
brandsma %>%
filter(!complete.cases(.)) %>%
slice_sample(n = 50) %>%
select(ses, iqv, iqp)
# impute & analyse as usual
imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE)
fit <- with(imp, lm(iqv ~ 1 + ses))
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, se.fit = TRUE)
Q <- sapply(predm, `[[`, "fit")
U <- sapply(predm, `[[`, "se.fit")^2
dfcom <- predm[[1]]$df
# pool predictions
pred <- matrix(NA, nrow = nrow(Q), ncol = 3,
dimnames = list(NULL, c("fit", "se.fit", "df")))
for(i in 1:nrow(Q)) {
pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1)
pred[i, 1] <- pi[["qbar"]]
pred[i, 2] <- sqrt(pi[["t"]])
pred[i, 3] <- pi[["df"]]
}
Perhaps we could evolve this into a predict.mira()
function.
FWIW, predm[[1]]
doesn't have a df object for glm, so dfcom ends up being NULL. It probably makes sense to get it from getfit(fit)[[1]]$df.null
assuming the null df is the correct one. Apologies if I am missing something obvious.
@markdanese Thanks for flagging this. The code is an example. It works for lm()
, but needs tweaking for other models.
Perhaps we could evolve this into a
predict.mira()
function.
Would really appreciate that implementation!
Hi, the option A above looks tempting, I'm trying to plot predicted values from lme model using multilevel multiple imputed longitudinal data.
However, when I run the code on the third step, predlist <- lapply(fit$analysis, broom::augment.xxx, se.fit = TRUE) I repeatedly get an error: 'augment.xxx' is not an exported object from 'namespace:broom', regardless of whether I use broom.mixed::augment.lme.
I tried it with nlme Orthodent data too, using fit <- with(imp, lm(distance~age*Sex, data = Orthodont)) predlist <- lapply(fit$analysis, broom::augment, se.fit = TRUE, data = Orthodont) predmat <- pool(predlist) Here, I get an error: Error in object$analyses[[1]] : subscript out of bounds
Could someone please give advice, how to get fitted values for plotting?
I also found the instructions from Haile, Sarah, 2019, Multiple Imputation using R, and noticed that "However van Buuren appears to disagree with this approach, which is perhaps why it’s not part of mice yet."
Your code says fit$analysis
. Does it help to say fit$analyses
?
I don't know Haile.
I am trying to calculate predicted probabilities from a multiply imputed mixed effect logistic regression model:
df_select <- df_collapsed %>% select("op.day","age.admission.nmds", "gender", "ethnicity2", "asa2", "C3.score.categories", "extent.nzcr", "site", "dep13.nmds.quint", "op.year", "mortality.90d", "dhb.nmds", "days.to.surg") imp <- mice(df_select, m=10) fit <- with(imp, glmer(mortality.90d ~ op.day + age.admission.nmds + gender + ethnicity2 + asa2 + C3.score.categories + extent.nzcr + site + dep13.nmds.quint + op.year + days.to.surg + (1 | dhb.nmds), family = binomial()))
I have tried option A but I am running into the same problem as mentioned above, swtiching 'analysis' for 'analyses' does not seem to help. Any suggestions would be greatly appreciated.
Perhaps using broom.mixed:::augment.merMod
will give you the correct predictions. However, pool()
will not work, because pooling currently only works on the level of the analyses (i.e., pooling the estimated parameters), not on the level of the data (i.e., pooling the predictions). If you want to pool the predictions, you will have to average the predictions and calculate the variances manually.
Please look at #32, #299 and the ignore
argument in mice()
Van: jeandigitale @.> Datum: dinsdag, 22 augustus 2023 om 20:58 Aan: amices/mice @.> CC: Buuren, S. van (Stef) @.>, Comment @.> Onderwerp: Re: [amices/mice] Request: make predictions possible after using pool() (#82) CAUTION: This email originated from outside of Utrecht University. Do not click links or open attachments unless you recognize the sender and know the content is safe.
Like @cryankinghttps://github.com/cryanking, I'm trying to use imputation in a real-time prediction pipeline and am looking for a way to estimate imputation models with training data and then apply those models to test data without having to re-estimate them. I have been searching for a package that has this feature with no luck yet!
— Reply to this email directly, view it on GitHubhttps://github.com/amices/mice/issues/82#issuecomment-1688755576, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA55ALYXMTH6THPAAJMQ2ZDXWT6OLANCNFSM4EZN5F5A. You are receiving this because you commented.Message ID: @.***>
For all other fans of Option A, here's a blog post showing how you can do it by hand: https://solomonkurz.netlify.app/post/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/
This link has changed and is now: https://solomonkurz.netlify.app/blog/2021-10-21-if-you-fit-a-model-with-multiply-imputed-data-you-can-still-plot-the-line/
The script below uses the @ASKurz example in a more micy way. It is essentially Rubin's rules applied to model predictions instead of model parameters.
library(mice) library(dplyr) # generate data set.seed(201) b_small <- brandsma %>% filter(!complete.cases(.)) %>% slice_sample(n = 50) %>% select(ses, iqv, iqp) # impute & analyse as usual imp <- mice(b_small, seed = 540, m = 10, method = "norm", print = FALSE) fit <- with(imp, lm(iqv ~ 1 + ses)) # obtain predictions Q and prediction variance U predm <- lapply(getfit(fit), predict, se.fit = TRUE) Q <- sapply(predm, `[[`, "fit") U <- sapply(predm, `[[`, "se.fit")^2 dfcom <- predm[[1]]$df # pool predictions pred <- matrix(NA, nrow = nrow(Q), ncol = 3, dimnames = list(NULL, c("fit", "se.fit", "df"))) for(i in 1:nrow(Q)) { pi <- pool.scalar(Q[i, ], U[i, ], n = dfcom + 1) pred[i, 1] <- pi[["qbar"]] pred[i, 2] <- sqrt(pi[["t"]]) pred[i, 3] <- pi[["df"]] }
Perhaps we could evolve this into a
predict.mira()
function.
Is this the most recent solution for lm() models? predict.mira()
does not seem to be implemented (yet).
Voilà. Will need to think about the default choices that are made in predict.lm()
before we implement this further.
library(purrr)
library(magrittr)
library(dplyr)
library(mice)
# generate data
set.seed(201)
b_small <-
brandsma %>%
filter(!complete.cases(.)) %>%
slice_sample(n = 50) %>%
select(ses, iqv, iqp)
imp <- b_small %>%
mice(seed = 540, m = 10, method = "norm", print = FALSE)
# fit with map() workflow
fit1 <- imp %>%
complete("all") %>%
map(~.x %$% lm(iqv ~ 1 + ses))
# fit with with.mids() workflow
fit2 <- with(imp, lm(iqv ~ 1 + ses))
# function to pool individual values cf. Rubin's rules
predict.mira <- function(fit, tiny = TRUE){
# preprocess to allow for flexible workflows
if("mira" %in% class(fit)){
fit <- fit %>%
.$analyses # extract the fitted analyses
}
# create predictions and intervals
pred <- fit %>%
map(~.x %>%
predict(se.fit = TRUE) %>% # create predictions with corresponding SE
as.data.frame() %>%
select(fit, se.fit, df)) # purge irrelevant
# pool predictions
# convert predictions to array and call pool.scalar over dimensions 1 and 2
pooled <- array(unlist(pred), dim = c(dim(pred[[1]]), length(pred))) %>%
apply(., c(1), function(x) pool.scalar(Q = x[1,],
U = x[2,]^2,
n = x[3,1] + 1)) %>%
sapply(unlist) %>%
t() %>%
as_tibble() %>%
mutate(se.fit = sqrt(t))
if(!tiny){
return(pooled)
} else
return(pooled %>% select(qbar, se.fit, df))
}
#tiny output
predict.mira(fit1)
#> # A tibble: 50 × 3
#> qbar se.fit df
#> <dbl> <dbl> <dbl>
#> 1 -1.57 0.489 27.2
#> 2 -1.12 1.29 3.45
#> 3 -0.578 0.335 29.3
#> 4 2.40 0.719 31.1
#> 5 0.450 0.865 4.94
#> 6 0.239 1.10 3.36
#> 7 0.0420 0.315 32.0
#> 8 -0.578 0.335 29.3
#> 9 0.0420 0.315 32.0
#> 10 1.03 0.430 32.7
#> # ℹ 40 more rows
predict.mira(fit2)
#> # A tibble: 50 × 3
#> qbar se.fit df
#> <dbl> <dbl> <dbl>
#> 1 -1.57 0.489 27.2
#> 2 -1.12 1.29 3.45
#> 3 -0.578 0.335 29.3
#> 4 2.40 0.719 31.1
#> 5 0.450 0.865 4.94
#> 6 0.239 1.10 3.36
#> 7 0.0420 0.315 32.0
#> 8 -0.578 0.335 29.3
#> 9 0.0420 0.315 32.0
#> 10 1.03 0.430 32.7
#> # ℹ 40 more rows
# full output
predict.mira(fit1, tiny = FALSE)
#> # A tibble: 50 × 29
#> m qhat1 qhat2 qhat3 qhat4 qhat5 qhat6 qhat7 qhat8 qhat9
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 -1.34 -1.73 -1.90 -1.81 -1.58 -1.46 -1.18 -1.41 -1.47
#> 2 10 0.135 0.0911 -0.663 -0.291 -2.49 -3.18 -2.03 -1.06 -0.122
#> 3 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 4 10 1.97 2.54 2.42 2.79 2.05 2.61 2.49 2.14 2.12
#> 5 10 -0.794 -0.130 1.18 0.263 0.818 -0.448 0.973 1.57 0.275
#> 6 10 -1.25 1.64 1.14 0.552 0.0164 -1.01 0.966 -0.796 0.149
#> 7 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 8 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 9 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 10 10 0.835 1.07 0.933 1.21 0.803 1.21 1.23 0.921 0.886
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> # u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> # ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>
predict.mira(fit2, tiny = FALSE)
#> # A tibble: 50 × 29
#> m qhat1 qhat2 qhat3 qhat4 qhat5 qhat6 qhat7 qhat8 qhat9
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 10 -1.34 -1.73 -1.90 -1.81 -1.58 -1.46 -1.18 -1.41 -1.47
#> 2 10 0.135 0.0911 -0.663 -0.291 -2.49 -3.18 -2.03 -1.06 -0.122
#> 3 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 4 10 1.97 2.54 2.42 2.79 2.05 2.61 2.49 2.14 2.12
#> 5 10 -0.794 -0.130 1.18 0.263 0.818 -0.448 0.973 1.57 0.275
#> 6 10 -1.25 1.64 1.14 0.552 0.0164 -1.01 0.966 -0.796 0.149
#> 7 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 8 10 -0.512 -0.665 -0.821 -0.661 -0.674 -0.439 -0.266 -0.520 -0.573
#> 9 10 0.00583 0.00304 -0.146 0.0569 -0.106 0.196 0.308 0.0340 -0.0114
#> 10 10 0.835 1.07 0.933 1.21 0.803 1.21 1.23 0.921 0.886
#> # ℹ 40 more rows
#> # ℹ 19 more variables: qhat10 <dbl>, u1 <dbl>, u2 <dbl>, u3 <dbl>, u4 <dbl>,
#> # u5 <dbl>, u6 <dbl>, u7 <dbl>, u8 <dbl>, u9 <dbl>, u10 <dbl>, qbar <dbl>,
#> # ubar <dbl>, b <dbl>, t <dbl>, df <dbl>, r <dbl>, fmi <dbl>, se.fit <dbl>
Created on 2024-06-10 with reprex v2.1.0
Since predict()
is pretty well adopted and implemented, I assume this might work more generally, i.e., for any object for which there is a predict
method. Right?
Would there be value in using broom::augment for this feature?
Yes; should work universally. Don't know if broom::augment()
would apply to all of those.
Dear @stefvanbuuren
Thank you for the example code for model prediction using imputed data. In that example, what you actually calculated were the "fitted values", not the "predicted values", because you didn't input a independent test dataset into predict()
function.
If there is a separate test dataset without any missing values, assumed to be collected in the future,
b_small_test <- brandsma %>%
filter( complete.cases(.) ) %>% # This line preserves all the complete records
slice_sample(n = 20) %>%
select(ses, iqv, iqp)
we can just add newdata = b_small_test
into predict()
, and the remaining part works perfectly.
# obtain predictions Q and prediction variance U
predm <- lapply(getfit(fit), predict, newdata = b_small_test, se.fit = TRUE)
However, more realistically, if the test dataset also includes missing values,
b_small_test_with_na <- mice::ampute(b_small_test)$amp
then newdata = b_small_test_with_na
will lead to the predictions also being missing values. How can the example code be modified to handle this situation?
I think we can use the imputed dataset from the model training phase to fill in missing values in the test set.
imp_test <- mice.mids(imp, newdata = b_small_test_with_na, maxit = 5)
Unfortunately, imp_test
is still a multiply imputed object with m=10 imputed results, and I cannot figure out how to accommodate it into lapply(getfit(fit), predict, se.fit = TRUE)
.
When using mice multiple imputation to generate multiple imputed sets and then running a model on each one of them (like elastic-net from the
glmnet
package) and then pooling the estimates to a single set - its not possible to usepredict.glmnet()
(orpredict()
) on the object that comes out ofmice::pool()
. The prediction functions expect a fitted model object whilemice::pool()
returnsIs it possible to add the ability to make predictions from the return object of
mice::pool()
?