amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
442 stars 107 forks source link

Create new variable after imputation #34

Closed stefvanbuuren closed 7 years ago

stefvanbuuren commented 7 years ago

This is a mail I got from Tobias Rolfes:

Datum: 20 mei 2017 15:48:53 GMT+5:30 Onderwerp: Mice: Create new variable after imputation

Hello Stef,

Thank you very much for creating such an useful package for multiple imputation.

Currently, I am facing the problem that I want to create a new variable after calculating imputations (e.g., sum scores of items) and calculate regressions with the new variabel. However, when I am doing so (cf., programm code below), the originally missing cases are deleted in the regression due to missings. Do you have an idea how I can solve the problem?

library(mice)
# Generate Data
A <- c(1, 2, 1, NA, 3, 4, 1, 2, 3)
B <- c(2, 3, 2, 3, 4, 4, 1, 2, NA)
C <- c(3, 4, 2, 3, 4, 4, 1, 3, 4)
Data <- data.frame(A,B,C)
# Imputation
imp <- mice(Data, method = "norm", m = 5, maxiter=1)
# Convert to Long
long <- complete(imp, action='long', include=TRUE)
# Generate new variable
long$newvar <- long$B
# Convert back to Mids
imput.short <- as.mids(long)
# Calculate Regression
RegModell0 <- with(imput.short,lm(C ~ A + newvar))
summary(RegModell0)

Many thanks in advance for your answer.

Best, Tobias

stefvanbuuren commented 7 years ago

Hi Tobias, thanks for your mail.

The as.mids() function calls the mice() function to get an initial mids object, which is then later post-processed by as.mids(). Your problem is caused by the default behaviour of mice, which removes any collinear variables at start-up, and so your ini$imp$newvarvariables get a NULL value, and any imputations it has get lost.

As this behaviour is confusing in the context of as.mids(), I will make some changes to the as.mids() and check.data() functions that will allow us to bypass removal of collinear variables at startup.

stefvanbuuren commented 7 years ago

You should be able to run your code in mice 2.37.

AlistairTurvill commented 6 years ago

Hi Stef,

To echo Tobias’s original post, thanks very much for creating MICE, I am still developing my R skills (so please excuse any stupid question) and am using your programme to try and carry out MI on a large repeated measures data set looking at outcomes for Chronic Pain patients at a UK hospital.

All of the (108) variables have missing data at various points, and so it presents challenges that Tobias may not have faced…

I am able to run the MI process successfully on the raw data using:

View(slimmed_down_working_data_file_for_R)
library(mice)
library(VIM)
library(lattice)
library(ggplot2)
options(max.print = 100000)
md.pattern(slimmed_down_working_data_file_for_R)
imputation <-
 mice(slimmed_down_working_data_file_for_R,m=1,maxit=50,meth='pmm',seed=500)

But then run in to difficulty…

I need to create a number of subscales from the MICE output that can then be used for further pooled analysis.

I understand that I am not able to do this directly from the pooled data set (the MICE output) and instead need to create an R object(s) that can be manipulated by the necessary functions (e.g. in Tobais’s case he converted to a Long format).

However, as I have applied MICE to a large data set I have ‘viewed’ , rather than ‘#Generating data’ in the console as Tobias has done; my individual raw variables are not present as individual in R as objects which can be included in then computing a subscale score.

In short:

Is it possible to calculate the mean score for a group of variables from the MICE output, and then have this score created as a new variable? If so what is the best way to approach.

If I manage this successfully, do I then need to convert back to a MIDS object (using the ‘imput.short <- as.mids’) before carrying out any further analysis?

Many thanks in advance, Alistair

gerkovink commented 6 years ago

Hi Alistair,

The issue you raise is straightforward to solve with passive imputation in mice. Please see the following example code based on the nhanes example data set from package mice:

set.seed(123)
new <- NA
nhanes3 <- cbind(nhanes, new)
ini <- mice(nhanes3, maxit = 0)
meth <- ini$meth
#set new to passive imputation 
meth["new"] <- "~ I(bmi + chl)"
imp <- mice(nhanes3, meth=meth)

This example calculates new as the sum of bmi and chl. If you calculate new at the end of each iteration (if you visit it last in each iteration) with passive imputation, new will always be the sum of the observed and/or imputed information. This solutions yields exactly what you are looking for.

> head(complete(imp))
  age  bmi hyp chl   new
1   1 26.3   1 118 144.3
2   2 22.7   1 187 209.7
3   1 30.1   1 187 217.1
4   3 25.5   2 204 229.5
5   1 20.4   1 113 133.4
6   3 22.7   1 184 206.7

If you'd like to know more about the specifics and caveats of passive imputation, please have a look at the corresponding vignette in the miceVignettes repository

All the best,

Gerko

AlistairTurvill commented 6 years ago

Hi Gerko,

Thanks very much for your help, much appreciated! I have tried to implement your code and have had some success, however I have encountered 25 'warnings' :

In Ops.factor(Rsf1, Rsf2) : ‘+’ not meaningful for factors

do you know why this is?

many thanks, Alistair

gerkovink commented 6 years ago

Yes, some of the variables you are trying to add together are factors, i.e. categorical variables where categories (labels) are represented by values. R is simply warning you that adding categorical variables may not be what you desire to prevent you from making an accidental error.

All the best,

Gerko

MengTingLo commented 6 years ago

Hello Stef and Gerko, I have some questions regarding passive imputation. I would like to create a sum score (variable = New) after the imputation. I used the codes Gerko suggested above and modified it.

I added a ID variable to nhanes and create a sum score based on chl and hyp.

nhanes$ID <- seq.int(nrow(nhanes))
New <- NA
Data_new<- cbind(nhanes, New)
ini <- mice(Data_new, max = 0, method = c('','pmm','pmm','pmm','',''))
meth <- ini$meth
meth["New"] <- "~I(chl+hyp)"

Then, I modified the predictor matrix because ID and New variable (sum score of chl+hyp) should not be predicted by other variables and should not be the predictor of other variables.

pred <- ini$predictorMatrix
pred[, "ID"] <- 0
pred["ID",] <- 0
pred["New",] <- 0
pred
    age bmi hyp chl ID New
age   0   1   1   1  0   0
bmi   1   0   1   1  0   0
hyp   1   1   0   1  0   0
chl   1   1   1   0  0   0
ID    0   0   0   0  0   0
New   0   0   0   0  0   0

Then, I run mice.

Test <- mice(Data_new, 
            meth = meth, 
            pred = pred,
            m=5,
            maxit=5,            
            diagnostics=TRUE,
            seed = 123456)
head(complete(Test))

I got strange result for the "New" variable (New=chl+hyp).

  age  bmi hyp chl ID         New
1   1 29.6   1 187  1 -0.18195807
2   2 22.7   1 187  2  1.22596857
3   1 27.2   1 187  3 -1.55123662
4   3 27.5   1 186  4  0.41716072
5   1 20.4   1 113  5  0.85837692
6   3 20.4   2 184  6 -0.07179878

After I removed the this line of code: pred["New",] <- 0, the result seems to be reasonable.

pred <- ini$predictorMatrix
pred[, "ID"] <- 0
pred["ID",] <- 0
#pred["New",] <- 0
PredictorMatrix:
    age bmi hyp chl ID New
age   0   1   1   1  0   0
bmi   1   0   1   1  0   0
hyp   1   1   0   1  0   0
chl   1   1   1   0  0   0
ID    0   0   0   0  0   0
New   1   1   1   1  0   0
> head(complete(Test2))
  age  bmi hyp chl ID New
1   1 29.6   1 187  1 188
2   2 22.7   1 187  2 188
3   1 27.2   1 187  3 188
4   3 27.5   1 186  4 187
5   1 20.4   1 113  5 114
6   3 20.4   2 184  6 186

Here are my questions:

  1. Age contains no missing data, I thought mice would set all values for age in the row of predictor matrix to 0, but it did not. I am not sure if that just happened in my computer or not.

  2. After I removed this line of code pred["New",] <- 0, the imputation seems to work well. However, the predictor matrix for variable "New" did not reflect its actual imputation model, would that be a problem?

SebVen commented 5 years ago

Hi Alistair,

The issue you raise is straightforward to solve with passive imputation in mice. Please see the following example code based on the nhanes example data set from package mice:

set.seed(123)
new <- NA
nhanes3 <- cbind(nhanes, new)
ini <- mice(nhanes3, maxit = 0)
meth <- ini$meth
#set new to passive imputation 
meth["new"] <- "~ I(bmi + chl)"
imp <- mice(nhanes3, meth=meth)

This example calculates new as the sum of bmi and chl. If you calculate new at the end of each iteration (if you visit it last in each iteration) with passive imputation, new will always be the sum of the observed and/or imputed information. This solutions yields exactly what you are looking for.

> head(complete(imp))
  age  bmi hyp chl   new
1   1 26.3   1 118 144.3
2   2 22.7   1 187 209.7
3   1 30.1   1 187 217.1
4   3 25.5   2 204 229.5
5   1 20.4   1 113 133.4
6   3 22.7   1 184 206.7

If you'd like to know more about the specifics and caveats of passive imputation, please have a look at the corresponding vignette in the miceVignettes repository

All the best,

Gerko

Hi Gerko, thanks for the useful explanation of passive imputation with mice. Could I add a question to this: would passive imputation also be applicable to change scores (i.e. outcome - baseline)? I can imagine a problem with this as we are assuming a correlation between the dependent and independent variable. However, perhaps I'm interpreting this issue incorrectly in the context of imputation. Any opinion from yourself or Stef on this would be very welcome. Thanks Sebastian

stefvanbuuren commented 4 years ago

Can’t find it on github…

Van: sophar notifications@github.com Beantwoorden - Aan: stefvanbuuren/mice reply@reply.github.com state_change@noreply.github.com Onderwerp: Re: [stefvanbuuren/mice] Create new variable after imputation (#34)

When I'm trying to generate the long dataset to create new variables after imputation, I get the following error message:

# Convert to Long
long <- mice::complete(df2, "long",include = TRUE)
Fehler: Column "pCare_doc" can't be converted from logical to numeric

I'm sorry, I did not manage to create a reproducible example for this, it just happens with my (large) dataset. But maybe you still have an idea what this could be? So pCare_doc is a logical variable, but why should it be converted?

stefvanbuuren commented 4 years ago

Unable to replicate. mice() converts logicals into 0/1 variables, but the following runs fine.

library(mice)

data <- data.frame(nhanes2, 
                   flags = rep(c(TRUE, FALSE, FALSE, NA, TRUE), 5))
imp <- mice(data, m = 1, print = FALSE)
long <- mice::complete(imp, "long", include = TRUE)
str(long)
imp2 <- as.mids(long)
imp2

# force logical
long2 <- long
long2$flags <- as.logical(long2$flags)
str(long2)
imp3 <- as.mids(long2)
sophar commented 4 years ago

Hello Stef, thanks a lot for your help, really appreciated. I deleted my question when I realized that the error is related to the automatic conversion of logicals (which I did not know before). So I've made a workaround to convert all logicals to factors before mice and complete and converting them back from factor to logical afterwards.

library(mice)

data <- data.frame(nhanes2, 
                   flags = rep(c(TRUE, FALSE, FALSE, NA, TRUE), 5))
data$flags <- factor(data$flags)
imp <- mice(data, m = 1, print = FALSE)
long <- mice::complete(imp, "long", include = TRUE)
long$flags <- as.logical(long$flags)
imp2 <- as.mids(long)
imp2

However, I'm not sure I understood your solution, as the error occurs when using complete.

anamgreco commented 4 years ago

Hello,

I am trying to do passive imputation as in @gerkovink 's example but in my case I need to use ifelse() function. I.e., instead of calculating the sum of two previous variables for the new varible, I need the new variable to be "0" if another previous variable is "0" and "1" otherwise. Is it possible to do this? If so, how?

Thanks a lot!

stefvanbuuren commented 4 years ago

Yes sure. The I() is just a function, so you can replace it by something else, e.g.

library(mice)
meth <- make.method(nhanes)
meth["bmi"] <- "~ ifelse(age == 1, 0, 1)"
imp <- mice(nhanes, method = meth, m = 1, maxit = 1, seed = 1)
head(complete(imp))
anamgreco commented 4 years ago

Hello, again and thank you!

I tried this but it keeps the new variable as NA.

library(mice) x1 <- c(0, 2, 1, NA, 0, 3, 1, 0, 0) x2 <- c(1, 3, 2, 3, 4, 4, 1, 2, NA) data<-data.frame(x1, x2) data$new<-NA meth <- make.method(data) meth["new"] <- " ~ ifelse(x1 == 0, 0, 1)" imp <- mice(data, method = meth, m = 1, maxit = 1, seed = 1) head(complete(imp))

This is my output:

head(complete(imp)) x1 x2 new 1 0 1 NA 2 2 3 NA 3 1 2 NA 4 3 3 NA 5 0 4 NA 6 3 4 NA

Maybe I am tipying something wrong? I am sorry! THANKS ONCE MORE!