amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
444 stars 107 forks source link

imputed values, when row has all NAs #499

Closed donaldRwilliams closed 2 years ago

donaldRwilliams commented 2 years ago

Hi, I am trying to wrap my head around what mice is doing when an entire row has NAs, but the values are imputed anyhow

Here is an example

ibrary(mice)

dat <- nhanes

dat[2,] <- NA

imps <- complete(mice(dat), action = "long")

subset( imps, .imp ==1 )[2,]

which returns 1 2 1 35.3 1 199.

I am curious how the imputation is being done, and how I can stop it from imputing for those rows.

Thanks !

thomvolker commented 2 years ago

Dear Donald,

By default, mice first replaces each NA in the data with a random draw of observed values from these variables (the so-called starting values). This is done so that the models used to generate imputations do not fail (which is the case if there are NAs in the predictors). If a row consists exclusively of NAs, all of these will be replaced with a randomly sampled value from each corresponding variable. Subsequently, these values are updated with every iteration, equivalently to what happens when only a subset of an observation's values is missing.

If you want to prevent this from happening, you can specify the where argument in mice, such as in the following example.

library(mice) # load mice
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind

df <- nhanes  # make df
df[2, ] <- NA # set second row to NA

where <- make.where(df) # specify where matrix

head(where) # by default, the second row is imputed
#>     age   bmi   hyp   chl
#> 1 FALSE  TRUE  TRUE  TRUE
#> 2  TRUE  TRUE  TRUE  TRUE
#> 3 FALSE  TRUE FALSE FALSE
#> 4 FALSE  TRUE  TRUE  TRUE
#> 5 FALSE FALSE FALSE FALSE
#> 6 FALSE  TRUE  TRUE FALSE

where[rowSums(where) == ncol(where), ] <- FALSE # change which cells are imputed

head(where) # now, the second row won't be imputed
#>     age   bmi   hyp   chl
#> 1 FALSE  TRUE  TRUE  TRUE
#> 2 FALSE FALSE FALSE FALSE
#> 3 FALSE  TRUE FALSE FALSE
#> 4 FALSE  TRUE  TRUE  TRUE
#> 5 FALSE FALSE FALSE FALSE
#> 6 FALSE  TRUE  TRUE FALSE

imp <- mice(df, m = 1, maxit = 1, where = where)
#> 
#>  iter imp variable
#>   1   1  bmi  hyp  chl

head(complete(imp)) # second row is now not imputed
#>   age  bmi hyp chl
#> 1   1 35.3   1 218
#> 2  NA   NA  NA  NA
#> 3   1 30.1   1 187
#> 4   3 27.4   2 204
#> 5   1 20.4   1 113
#> 6   3 22.5   1 184

Created on 2022-09-03 by the reprex package (v2.0.1)

I hope this helps, but let us know if you have any further questions or concerns.

Best, Thom

stefvanbuuren commented 2 years ago

Thanks Tom for answering.