amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
447 stars 108 forks source link

Conditional imputation (ifdo) related issue #548

Closed yningvu closed 1 year ago

yningvu commented 1 year ago

Hi Stef, I got a problem while implementing the post() function when imputing some downstream variables in my dataset.

I have a data frame, my_dt, which is a Nx40 data frame with some missingess, and I would like to impute a variable called "employed" and its downstream variables, e.g. "workinghour", "workingdays" etc.

my_dt = df( ... , employed = c(0, NA, 1, 0, NA, 0, NA, 1,1, NA), workinghours = c(0, NA, NA, 0, NA, 0, NA, 22, 30, NA), ... )

The issue is if the imputed "employed" is 0, then I would like to set all the downstream vars equal to 0 as well. (unemployed should have 0 workinghours and workingdays etc.)

I'd looked into your discussions of related issues, and also tried code lines in #43, #125, and #258 etc. So I'm setting the post like this

post["workinghours"] <- "imp[[j]][imp[[j]]$data$employed[!r[,j]]==0, i] <- 0"

It worked without any error. But I found my code did not work as expected: there are non-zero imputations of "workinghours" for "employed"==0 observations.

I'm wondering how to fix it. Thank you in advance for your time.

gerkovink commented 1 year ago

Any chance that the visiting sequence may be the culprit here? i.e. that workinghours is updated before employed is imputed.

yningvu commented 1 year ago

Hi Gerko,

Thanks for your quick reply and advice. I just quickly tried to fully specify a sequence of blocks (which are the list of my vars needed to be imputed as I understood) for the visitSequence argument. Unfortunately, the issue is still there.

gerkovink commented 1 year ago

Can you create a reprex?

yningvu commented 1 year ago

Sure, see below.

library(mice)
#> Warning: package 'mice' was built under R version 4.2.3
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library(data.table)

my_dt = readRDS("~/testdt.rds")

dummy_col = c('employed')

my_dt[, (dummy_col) := lapply(.SD, factor), .SDcols = dummy_col]

ini <- mice(my_dt,predictorMatrix=quickpred(my_dt,minpuc=0.9,mincor=0.1),maxit=0,seed=101)

post <- ini$post

pred <- ini$pred

## ASSIGN METHODS
method <- ini$meth
method["age"] <- ""

cols_to_pmm= c("edu", "height", "weight", "workinghours")

for(i in cols_to_pmm){
  method[i] <- "pmm"
}

post["workinghours"] <- "imp[[j]][imp[[j]]$data$employed[!r[,j]]==0, i] <- 0"
pred[c("employed"),c("workinghours")] <-0

visit_seq = c("edu","height","weight","employed","workinghours")
imp <- mice(data= my_dt, maxit = 5, predictorMatrix = pred, post=post, method=method, m=2, visitSequence = visit_seq)
#> 
#>  iter imp variable
#>   1   1  edu  height  weight  employed  workinghours
#>   1   2  edu  height  weight  employed  workinghours
#>   2   1  edu  height  weight  employed  workinghours
#>   2   2  edu  height  weight  employed  workinghours
#>   3   1  edu  height  weight  employed  workinghours
#>   3   2  edu  height  weight  employed  workinghours
#>   4   1  edu  height  weight  employed  workinghours
#>   4   2  edu  height  weight  employed  workinghours
#>   5   1  edu  height  weight  employed  workinghours
#>   5   2  edu  height  weight  employed  workinghours

completeimp <- complete(imp)

Created on 2023-04-17 with reprex v2.0.2

I just found that the imp obj seems to fulfill the post-processing condition but the completeimp obj does not. Could that be the problem of my complete()?

thomvolker commented 1 year ago

Could the problem be that complete only sets values to zero that are actually included as 1 (or TRUE, for that matter) in the where() matrix. That is, if you have actually observed data for those variables, whereas the variable employed is NA, these values might not be overwritten (which might be what should happen, if people with an NA for employed have actual values for the downstream variables)?

yningvu commented 1 year ago

Could the problem be that complete only sets values to zero that are actually included as 1 (or TRUE, for that matter) in the where() matrix. That is, if you have actually observed data for those variables, whereas the variable employed is NA, these values might not be overwritten (which might be what should happen, if people with an NA for employed have actual values for the downstream variables)?

I think I'd already ruled out the possibility of this scenario while cleaning and recoding the data. As you can see, I ran this line of code:

 summary(my_dt[is.na(my_dt$employed),'workinghours'])

Created on 2023-04-17 with reprex v2.0.2

The result is this:

workinghours Min. : NA
1st Qu.: NA
Median : NA
Mean :NaN
3rd Qu.: NA
Max. : NA
NA's :3669

thomvolker commented 1 year ago

I spent some more time looking into this, and I think that the error is in your specification of the post-processing arguments. As far as I know, there is no data variable within imp, so I'm not sure what you are overwriting exactly. In the toy example below, the following code works, and seems to be close to what you want to do. The ifelse() statement says that if your variable of interest is missing (i.e., r[,j] == 0) and bmi > 30, impute the value 10 for hyp (which is outside the range of the data, to not make things confusing), and otherwise, retain the original imputation.

Note that the use of ifelse() is saver, because it will not throw an error if there are no imputed values greater than 30 (or equal to 0 in your case), but potentially a bit slower.

In your case, I think you could modify it to something like post["workinghours'] <- "imp[[j]][, i] <- ifelse(data[r[,j] == 0, 'employed'] == 0, 0, imp[[j]][,i])"

I hope this solves the problem!

library(mice)
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
post <- make.post(nhanes)
post["hyp"] <- "imp[[j]][, i] <- ifelse(data[r[,j] == 0, 'bmi'] > 30, 10, imp[[j]][,i])"
imp <- mice(nhanes, post = post, seed = 1, printFlag = FALSE)
imp$imp
#> $age
#> [1] 1 2 3 4 5
#> <0 rows> (or 0-length row.names)
#> 
#> $bmi
#>       1    2    3    4    5
#> 1  27.4 35.3 27.2 29.6 24.9
#> 3  28.7 27.2 35.3 30.1 29.6
#> 4  25.5 20.4 22.0 20.4 25.5
#> 6  24.9 21.7 22.7 24.9 27.4
#> 10 28.7 22.7 22.0 20.4 22.7
#> 11 30.1 29.6 35.3 22.7 33.2
#> 12 26.3 27.2 33.2 27.4 22.5
#> 16 27.2 27.2 35.3 22.7 24.9
#> 21 26.3 29.6 27.2 29.6 27.2
#> 
#> $hyp
#>     1  2  3 4  5
#> 1   1 10  1 1  1
#> 4   1  1  1 1  2
#> 6   1  2  2 1  1
#> 10  1  1  1 1  1
#> 11 10  1 10 1 10
#> 12  1  1 10 2  1
#> 16  1  1 10 1  1
#> 21  1  1  1 1  1
#> 
#> $chl
#>      1   2   3   4   5
#> 1  199 284 187 187 238
#> 4  218 118 184 187 238
#> 10 206 187 186 238 187
#> 11 218 206 186 131 187
#> 12 199 199 187 184 187
#> 15 206 186 229 229 206
#> 16 218 184 184 113 238
#> 20 199 218 206 184 187
#> 21 229 204 131 206 187
#> 24 206 238 284 199 218
complete(imp, "all")
#> $`1`
#>    age  bmi hyp chl
#> 1    1 27.4   1 199
#> 2    2 22.7   1 187
#> 3    1 28.7   1 187
#> 4    3 25.5   1 218
#> 5    1 20.4   1 113
#> 6    3 24.9   1 184
#> 7    1 22.5   1 118
#> 8    1 30.1   1 187
#> 9    2 22.0   1 238
#> 10   2 28.7   1 206
#> 11   1 30.1  10 218
#> 12   2 26.3   1 199
#> 13   3 21.7   1 206
#> 14   2 28.7   2 204
#> 15   1 29.6   1 206
#> 16   1 27.2   1 218
#> 17   3 27.2   2 284
#> 18   2 26.3   2 199
#> 19   1 35.3   1 218
#> 20   3 25.5   2 199
#> 21   1 26.3   1 229
#> 22   1 33.2   1 229
#> 23   1 27.5   1 131
#> 24   3 24.9   1 206
#> 25   2 27.4   1 186
#> 
#> $`2`
#>    age  bmi hyp chl
#> 1    1 35.3  10 284
#> 2    2 22.7   1 187
#> 3    1 27.2   1 187
#> 4    3 20.4   1 118
#> 5    1 20.4   1 113
#> 6    3 21.7   2 184
#> 7    1 22.5   1 118
#> 8    1 30.1   1 187
#> 9    2 22.0   1 238
#> 10   2 22.7   1 187
#> 11   1 29.6   1 206
#> 12   2 27.2   1 199
#> 13   3 21.7   1 206
#> 14   2 28.7   2 204
#> 15   1 29.6   1 186
#> 16   1 27.2   1 184
#> 17   3 27.2   2 284
#> 18   2 26.3   2 199
#> 19   1 35.3   1 218
#> 20   3 25.5   2 218
#> 21   1 29.6   1 204
#> 22   1 33.2   1 229
#> 23   1 27.5   1 131
#> 24   3 24.9   1 238
#> 25   2 27.4   1 186
#> 
#> $`3`
#>    age  bmi hyp chl
#> 1    1 27.2   1 187
#> 2    2 22.7   1 187
#> 3    1 35.3   1 187
#> 4    3 22.0   1 184
#> 5    1 20.4   1 113
#> 6    3 22.7   2 184
#> 7    1 22.5   1 118
#> 8    1 30.1   1 187
#> 9    2 22.0   1 238
#> 10   2 22.0   1 186
#> 11   1 35.3  10 186
#> 12   2 33.2  10 187
#> 13   3 21.7   1 206
#> 14   2 28.7   2 204
#> 15   1 29.6   1 229
#> 16   1 35.3  10 184
#> 17   3 27.2   2 284
#> 18   2 26.3   2 199
#> 19   1 35.3   1 218
#> 20   3 25.5   2 206
#> 21   1 27.2   1 131
#> 22   1 33.2   1 229
#> 23   1 27.5   1 131
#> 24   3 24.9   1 284
#> 25   2 27.4   1 186
#> 
#> $`4`
#>    age  bmi hyp chl
#> 1    1 29.6   1 187
#> 2    2 22.7   1 187
#> 3    1 30.1   1 187
#> 4    3 20.4   1 187
#> 5    1 20.4   1 113
#> 6    3 24.9   1 184
#> 7    1 22.5   1 118
#> 8    1 30.1   1 187
#> 9    2 22.0   1 238
#> 10   2 20.4   1 238
#> 11   1 22.7   1 131
#> 12   2 27.4   2 184
#> 13   3 21.7   1 206
#> 14   2 28.7   2 204
#> 15   1 29.6   1 229
#> 16   1 22.7   1 113
#> 17   3 27.2   2 284
#> 18   2 26.3   2 199
#> 19   1 35.3   1 218
#> 20   3 25.5   2 184
#> 21   1 29.6   1 206
#> 22   1 33.2   1 229
#> 23   1 27.5   1 131
#> 24   3 24.9   1 199
#> 25   2 27.4   1 186
#> 
#> $`5`
#>    age  bmi hyp chl
#> 1    1 24.9   1 238
#> 2    2 22.7   1 187
#> 3    1 29.6   1 187
#> 4    3 25.5   2 238
#> 5    1 20.4   1 113
#> 6    3 27.4   1 184
#> 7    1 22.5   1 118
#> 8    1 30.1   1 187
#> 9    2 22.0   1 238
#> 10   2 22.7   1 187
#> 11   1 33.2  10 187
#> 12   2 22.5   1 187
#> 13   3 21.7   1 206
#> 14   2 28.7   2 204
#> 15   1 29.6   1 206
#> 16   1 24.9   1 238
#> 17   3 27.2   2 284
#> 18   2 26.3   2 199
#> 19   1 35.3   1 218
#> 20   3 25.5   2 187
#> 21   1 27.2   1 187
#> 22   1 33.2   1 229
#> 23   1 27.5   1 131
#> 24   3 24.9   1 218
#> 25   2 27.4   1 186
#> 
#> attr(,"class")
#> [1] "mild" "list"

Created on 2023-04-18 with reprex v2.0.2

yningvu commented 1 year ago

Thank you for your suggestion! Unfortunately, it did not work so well on my data. I think there might be some confusion here due to my description. However, your solution does give me some hint.

If I understood correctly, the code line from you

`post["workinghours"] <- "imp[[j]][, i] <- ifelse(data[r[,j] == 0, 'employed'] == 0, 0, imp[[j]][,i])"`

tries to mutate those (NA in var j) observations that have employed==0 with 0, else keep it as imputed. This is actually done before I implemented MICE because employed==0 directly leads to workinghours==0 in my setup. I think i had done this during cleaning.

In fact, I need to mutate the imputed values of workinghours of those observations with imputed employed==0 in the i-th imputation by 0. I think I have resolved the issue with the following code. It basically works in 2 steps. First, it generates the row indices of those observations with imputed employed==0 in the current i-th imputation. Second, it changes the values of the imputed workinghours of these obs to 0.

`post["workinghours"] <- paste(sep = ";",

                          'idx <- row.names(imp[["employed"]][imp[["employed"]][,i]==0,])',

                          'imp[[j]][idx, i] <- 0')`

And the complete is now generating expected imputation results. Thank you all for the help!! This issue might be closed.