amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
446 stars 108 forks source link

Impute then Transform - Using residuals as new variables for further pooled analysis #298

Closed SunshineCheesesauce closed 3 years ago

SunshineCheesesauce commented 3 years ago

Hello, I have spent a while reading through all the issues and trying to get this to work but can't seem to find an answer. First, thank you for all your support and this package.

Background: I am assessing the validity of a new method of measuring resilience which takes the residuals of linear regression models and uses them in further models as outcome and predictor variables.

Problem (using reproducible from nhanes).

I impute the data set:

imputed <- mice(nhanes, method = meth, predictorMatrix = predM, m=20, maxit = 20)

I then want to find the best fit model for predicting bmi using stepwise selection (I have 15 predictors in my actual dataset):

scope <- list(upper = ~ age + hyp + chl, lower = ~1) expr <- expression(f1 <- lm(bmi ~ 1), f2 <- step(f1, scope = scope)) fit <- with(imputed, expr). formulas <- lapply(fit$analyses, formula) terms <- lapply(formulas, terms) votes <- unlist(lapply(terms, labels)) table(votes)

I find my final model:

model <- with(imputed, lm(bmi ~ age + hyp + chl)).

All fine up to this point. I now try and save the residuals and the predicted bmi based on the model as new variables:

imputed$data$RS1=NULL imputed$data$PS1=NULL

for(i in 1:20){ imputed$data$RS1= residuals(model$analyses[[i]]) imputed$data$PS1= predict(model$analyses[[i]]) }

I then want to save my new variable which is the difference between the predicted and actual bmi

imputed$data$new_variable<- imputed$data$PS1 - imputed$data$bmi

The results at this point should hypothetically be the inverse of the residuals but I get very strange results.

I then want to do further analysis (using additional variables that were also in the original imputation). e.g.

fit1 <- with(imputed, lm(new_variable ~x1 + x2 + x3))

but I get the error : Error in imp[[j]] : subscript out of bounds.

I also can't use the complete() function on this once I have added these new variables.

Can you please advise on how I can work around this and also if I am saving the residuals correctly. My dataset is very large and the imputation currently takes over 24 hours so it's difficult for me to keep running mice() to get a workaround. If I passively impute RS1, PS1 and new_variable with them currently being all missing would this work?

Many thanks!

SunshineCheesesauce commented 3 years ago

Also - I should add, the model selection post imputation is different than the model selection with complete case analysis so I wanted to derive the new variable post imputation rather than imputing it as JAV

gerkovink commented 3 years ago

I believe the below reprex would suit your purpose.

library(mice)     # Multiple Imputation
library(dplyr)    # Data manipulation
library(tidyr)    # Tidy data
library(magrittr) # Pipes
library(purrr)    # Functional programming - map()
set.seed(123)     # Fix RNG seed

# impute
imp <- mice(nhanes, printFlag = FALSE)

# change completed data and pool analyses
complete(imp, "all") %>% 
  map(~ mutate(., bmipred = lm(bmi ~ hyp + chl + age)$fitted.values) %>% 
        mutate(., diff = bmi - bmipred)) %>% 
  map(lm, formula = diff ~ bmi + bmipred) %>% 
  pool()

#> Class: mipo    m = 5 
#>          term m      estimate         ubar            b            t dfcom
#> 1 (Intercept) 5  2.273737e-15 6.957934e-29 9.047288e-29 1.781468e-28    22
#> 2         bmi 5  1.000000e+00 8.277838e-32 1.016891e-31 2.048053e-31    22
#> 3     bmipred 5 -1.000000e+00 1.797064e-31 1.725633e-31 3.867824e-31    22
#>         df      riv    lambda       fmi
#> 1 4.558938 1.560340 0.6094269 0.7127676
#> 2 4.739556 1.474140 0.5958192 0.7002646
#> 3 5.618065 1.152301 0.5353811 0.6432055

Created on 2020-12-16 by the reprex package (v0.3.0)

If your derived variable should guide imputations, then passive imputation would be needed.

gerkovink commented 3 years ago

Closing as this seems sufficiently addressed in the above reprex

SunshineCheesesauce commented 3 years ago

Hi, thank you for your help but I am afraid it still isn't working for me. I want to continue to use this derived variable for further analyses but cannot seem to find away to store the derived variable within the mids that allows me to continue to use it for further analyses

gerkovink commented 3 years ago

I believe that the derived variable is independent from the imputation process. Simply inserting your desired analysis at the below location in this pseudo-code pipe would therefore be sufficient:

complete(imp, "all") %>% 
  map(~ mutate(., bmipred = lm(bmi ~ hyp + chl + age)$fitted.values) %>% 
        mutate(., diff = bmi - bmipred)) %>% 
  map(HERE YOUR ANALYSIS) %>% 
  pool()
SunshineCheesesauce commented 3 years ago

Thank you for this. Is there a way to convert a mild object back to a mids? I am using the derived variable in further stepwise model selection (as per the workflow) and not sure how I would do this within the above code.

SunshineCheesesauce commented 3 years ago

For example:

impute

imp <- mice(nhanes, printFlag = FALSE)

set scope and expression before analsyis

scope <- list(upper = ~ diff + hyp + chl, lower = ~1) expr <- expression(f1 <- lm(age ~ 1), f2 <- step(f1, scope = scope))

change completed data and pool analyses

complete(imp, "all") %>% map(~ mutate(., bmipred = lm(bmi ~ hyp + chl + age)$fitted.values) %>% mutate(., diff = bmi - bmipred)) %>% map(~mutate(., fit = with(imp, expr)) %>% mutate(.,formulas = lapply(fit$analyses, formula)) %>% mutate(.,terms = lapply(formulas, terms)) %>% mutate(.,votes= unlist(lapply(terms, labels))))

This takes the error: Error: Problem with mutate() input fit. x Input fit must be a vector, not a mira/matrix object. i Input fit is with(imp, expr)

gerkovink commented 3 years ago

Have a look at as.mids(). It may suit your purpose.

SunshineCheesesauce commented 3 years ago

Thank you, and sorry to keep questioning - I can't seem to find a workaround for this. In this section:

change completed data and pool analyses

NEW <- complete(imp, "all") %>% map(~ mutate(., bmipred = lm(bmi ~ hyp + chl + age)$fitted.values) %>% mutate(., diff = bmi - bmipred))

NEW is a list. I cannot use as.mids() to convert it back to mids because the original data is not included in the "all" part. If I add include = T, then an error occurs due to the missing in the 'bmipred'. I would like to be able to change the completed analysis as per above but reincorporate it within mids to do further analyses.

Thanks again

SunshineCheesesauce commented 3 years ago

I have found a way using miceadds:datlist2mids