amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
422 stars 106 forks source link

Different results within mice compared to directly using method functions #645

Open jsepin opened 1 month ago

jsepin commented 1 month ago

Describe the bug

I wanted to conduct a simulation study with one missing variable using mice.impute.lasso.select.norm. However, this method consistently failed to select the correct variables when used within the mice, despite working correctly when used directly. To address this issue, I created my own simplified wrapper function (called it short_mice), which should be sufficient for this situation. I then compared its performance to the mice wrapper through a short simulation using several methods (not justlasso.select.norm) and observed quite different results.

To Reproduce

Nsim = 10 and m = 5 should be increased!

rm(list = ls())
require(tidyverse)
require(mice)
require(mvtnorm)

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# Data generation ----
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

dgp <- function(n=50, p = 100, b_trt = 5, b_1 = 2, sig = 1.5, perc_missing = 0.3, miss_form = "~VOI*X1"){
  X <- mvtnorm::rmvnorm(n= n, sigma = diag(p))
  x1 <- X[,1]
  colnames(X) <- paste0("X", 1:ncol(X))
  noise <- rnorm(n=n)
  VOI <- rep(c(0,1),ceiling(length(x1)/2) )[1:length(x1)]
  y <- 0 + b_1*x1 + b_trt*VOI + noise*sig
  data <- data.frame(y,VOI,X)

  # Missing generation (MAR) only in the outcome y
  m_mis <- glm(family = "binomial",
               data = data,
               formula = formula(paste0("I(y < quantile(data$y,perc_missing))",miss_form))
               )
  mi <- predict(m_mis, newdata = data, type = "response")# 1 = Missing
  mi_bin <- rbinom(n = nrow(data), size = 1, prob = mi)
  amp <- data
  amp$y[mi_bin==1] <- NA 
  return(list("data" = data, "amp" = amp))
}

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# Shortcut mice ----
# can only impute one missing variable
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# data = data$amp
# m = 3
# method = "norm"

short_mice <- function(data, m, method,printFlag=FALSE, ...){

  state <- list(it = 0, im = 0, dep = "", meth = "", log = FALSE)
  loggedEvents <- data.frame(it = 0, im = 0, dep = "", meth = "", out = "")

  y <- data$y
  x <- as.matrix(data %>% dplyr::select(-y))
  ry <- !is.na(y) 
  wy = NULL

  mi <- lapply(0:m,function(i){
    runif(1)
    if(i==0){
      mi <- data.frame(.imp = i, .id = 1:nrow(x) , y=y, x)
    }else{
      eval(parse(
        text = paste0("ymis <- mice.impute.",method, "(y,ry,x,wy, ...)")
      ))
      mi <- data.frame(.imp = i, .id = 1:nrow(x) , y=y, x)
      mi$y[!ry] <- ymis
    }
    return(mi)
  })

  mi <- do.call("rbind", mi) %>% as.mids()
  return(mi)
}

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# Run small simulation ----
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

# run with higher numbers!
Nsim = 10 #100 ## INCREASE!!
m = 5 # 30 ## INCREASE!!
maxit = 1
analysis_form = "y~VOI"
method = c("lasso.select.norm", "norm")
wrapper = c("mice", "short_mice")

method_fact <- expand.grid(method = c("norm","lasso.select.norm","lasso.norm","rf")
                          ,wrapper = c("mice","short_mice")
                          ,stringsAsFactors = FALSE
                          )

set.seed(32325329)
res <- lapply(1:Nsim,function(s){
  # print(s)
  data <- dgp()
  res <- lapply(1:nrow(method_fact), function(i){

    eval(parse(
      text = paste0("mice_m <- ",method_fact$wrapper[i]
                    ,"(data = data$amp, m = m, maxit = maxit, method = '"
                    ,method_fact$method[i]
                    ,"',ls.meth = 'ridge', printFlag = FALSE)"
                   )
              ))
    imp_info <- mice_m %>% with(data = ., lm(formula = formula(analysis_form) )) %>% pool()
    mice_res <- imp_info %>% summary() %>%
      subset(., term=="VOI", c("estimate","std.error")) %>%
      data.frame()

    return(mice_res)

  })
  res <- do.call("rbind", res)

  # complete case
  mod <- lm(data = data$amp, formula(analysis_form))
  cc_res <- summary(mod)$coef["VOI",1:2]
  cc_res <- data.frame(t(cc_res), "cc", NA)
  names(cc_res) <- c("estimate","std.error", "method", "wrapper")

  # imputing true model
  mod <- lm(data = data$data, formula(analysis_form))
  tc_res <- summary(mod)$coef["VOI",1:2]
  tc_res <- data.frame(t(tc_res) )
  names(tc_res) <- c("true_estimate","true_std.error")

  # combine
  res <- cbind(rbind(cbind(res,method_fact),cc_res), tc_res)
  res$id <- s

  # return
  return(res)
})
res <- do.call("rbind", res)

#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
# Visualization ----
#%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

res %>%
  ggplot(aes( y =paste0(method,"+",wrapper )
              ,x = (estimate -true_estimate)/true_estimate
              ,fill = method
  ) )+
  geom_boxplot()+
  geom_point()+
  theme_bw()+
  geom_vline(xintercept = 0)

Created on 2024-05-28 with reprex v2.1.0

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.3.2 (2023-10-31 ucrt) #> os Windows 10 x64 (build 19045) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate German_Switzerland.utf8 #> ctype German_Switzerland.utf8 #> tz Europe/Zurich #> date 2024-05-28 #> pandoc 3.1.1 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.3.1) #> boot 1.3-28.1 2022-11-22 [2] CRAN (R 4.3.2) #> broom 1.0.5 2023-06-09 [1] CRAN (R 4.3.2) #> cli 3.6.1 2023-03-23 [1] CRAN (R 4.3.2) #> codetools 0.2-19 2023-02-01 [2] CRAN (R 4.3.2) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.3.2) #> curl 5.1.0 2023-10-02 [1] CRAN (R 4.3.2) #> digest 0.6.35 2024-03-11 [1] CRAN (R 4.3.3) #> dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.3.2) #> evaluate 0.23 2023-11-01 [1] CRAN (R 4.3.2) #> fansi 1.0.6 2023-12-08 [1] CRAN (R 4.3.3) #> farver 2.1.1 2022-07-06 [1] CRAN (R 4.3.2) #> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.3.2) #> forcats * 1.0.0 2023-01-29 [1] CRAN (R 4.3.2) #> foreach 1.5.2 2022-02-02 [1] CRAN (R 4.3.2) #> fs 1.6.3 2023-07-20 [1] CRAN (R 4.3.2) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.3.2) #> ggplot2 * 3.5.0 2024-02-23 [1] CRAN (R 4.3.3) #> glmnet 4.1-8 2023-08-22 [1] CRAN (R 4.3.2) #> glue 1.7.0 2024-01-09 [1] CRAN (R 4.3.3) #> gtable 0.3.4 2023-08-21 [1] CRAN (R 4.3.2) #> highr 0.10 2022-12-22 [1] CRAN (R 4.3.2) #> hms 1.1.3 2023-03-21 [1] CRAN (R 4.3.2) #> htmltools 0.5.7 2023-11-03 [1] CRAN (R 4.3.2) #> iterators 1.0.14 2022-02-05 [1] CRAN (R 4.3.2) #> jomo 2.7-6 2023-04-15 [1] CRAN (R 4.3.2) #> knitr 1.45 2023-10-30 [1] CRAN (R 4.3.2) #> labeling 0.4.3 2023-08-29 [1] CRAN (R 4.3.1) #> lattice 0.21-9 2023-10-01 [2] CRAN (R 4.3.2) #> lifecycle 1.0.4 2023-11-07 [1] CRAN (R 4.3.3) #> lme4 1.1-35.1 2023-11-05 [1] CRAN (R 4.3.2) #> lubridate * 1.9.3 2023-09-27 [1] CRAN (R 4.3.2) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.3.2) #> MASS 7.3-60 2023-05-04 [2] CRAN (R 4.3.2) #> Matrix 1.6-1.1 2023-09-18 [2] CRAN (R 4.3.2) #> mice * 3.16.0 2023-06-05 [1] CRAN (R 4.3.2) #> minqa 1.2.6 2023-09-11 [1] CRAN (R 4.3.2) #> mitml 0.4-5 2023-03-08 [1] CRAN (R 4.3.2) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.3.2) #> mvtnorm * 1.2-3 2023-08-25 [1] CRAN (R 4.3.2) #> nlme 3.1-163 2023-08-09 [2] CRAN (R 4.3.2) #> nloptr 2.0.3 2022-05-26 [1] CRAN (R 4.3.2) #> nnet 7.3-19 2023-05-03 [2] CRAN (R 4.3.2) #> pan 1.9 2023-08-21 [1] CRAN (R 4.3.2) #> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.3.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.3.2) #> purrr * 1.0.2 2023-08-10 [1] CRAN (R 4.3.2) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.3.3) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.3.3) #> R.oo 1.26.0 2024-01-24 [1] CRAN (R 4.3.3) #> R.utils 2.12.3 2023-11-18 [1] CRAN (R 4.3.3) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.3.2) #> ranger 0.16.0 2023-11-12 [1] CRAN (R 4.3.2) #> Rcpp 1.0.12 2024-01-09 [1] CRAN (R 4.3.3) #> readr * 2.1.4 2023-02-10 [1] CRAN (R 4.3.2) #> reprex 2.1.0 2024-01-11 [1] CRAN (R 4.3.3) #> rlang 1.1.2 2023-11-04 [1] CRAN (R 4.3.2) #> rmarkdown 2.25 2023-09-18 [1] CRAN (R 4.3.2) #> rpart 4.1.21 2023-10-09 [2] CRAN (R 4.3.2) #> rstudioapi 0.15.0 2023-07-07 [1] CRAN (R 4.3.2) #> scales 1.3.0 2023-11-28 [1] CRAN (R 4.3.3) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.3.2) #> shape 1.4.6.1 2024-02-23 [1] CRAN (R 4.3.2) #> stringi 1.8.3 2023-12-11 [1] CRAN (R 4.3.2) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.3.2) #> styler 1.10.3 2024-04-07 [1] CRAN (R 4.3.3) #> survival 3.5-7 2023-08-14 [2] CRAN (R 4.3.2) #> tibble * 3.2.1 2023-03-20 [1] CRAN (R 4.3.2) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.3.2) #> tidyselect 1.2.1 2024-03-11 [1] CRAN (R 4.3.3) #> tidyverse * 2.0.0 2023-02-22 [1] CRAN (R 4.3.2) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.3.2) #> tzdb 0.4.0 2023-05-12 [1] CRAN (R 4.3.2) #> utf8 1.2.4 2023-10-22 [1] CRAN (R 4.3.2) #> vctrs 0.6.5 2023-12-01 [1] CRAN (R 4.3.3) #> withr 3.0.0 2024-01-16 [1] CRAN (R 4.3.3) #> xfun 0.41 2023-11-01 [1] CRAN (R 4.3.2) #> xml2 1.3.5 2023-07-06 [1] CRAN (R 4.3.2) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.3.2) #> #> [1] C:/Users/SepinJ/AppData/Local/R/win-library/4.3 #> [2] C:/Program Files/R/R-4.3.2/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Expected behavior

I would expect the outputs to be more similar.

Thank you very much for any clarification!

EdoardoCostantini commented 1 month ago

The difference comes from the pre-processing done by the mice() function. In particular, the remove.lindep() function is run by default in mice to try to stabilize the imputation models. When trying to impute the data you are generating, many variables are dropped out of the imputation models before the mice.impute. univariate imputation methods are called. You can check this by running the following:

# Set seed
set.seed(20240731)

# Generate data using your dgp function
data <- dgp()

# Run mice
mice_m <- mice(
    data = data$amp, 
    m = 2, maxit = 2, 
    method = "lasso.select.norm", 
    ls.meth = "ridge", 
    printFlag = FALSE
)

# Check logged events
mice_m$loggedEvents

In the logged events, you should see that many variables have been "kicked out," meaning they never made it to the mice.impute.lasso.select.norm() function. With the wrapper you wrote, the linear dependency checks are not performed, and the lasso is the only way you handle variable selection. To avoid this pre-processing in mice() you can set the eps argument in the mice() call to 0. In your code, change the eval statement to:

        eval(parse(
          text = paste0(
            "mice_m <- ", method_fact$wrapper[i],
            "(data = data$amp, m = m, maxit = maxit, method = '",
            method_fact$method[i],
            "',ls.meth = 'ridge', printFlag = FALSE, eps = 0)"
          )
        ))

I ran your code with this change, and the resulting plot is the following:

There are no more differences between your wrapper and the standard mice().

I think I should clarify that eps = 0 is recommended when running lasso-based imputation methods. I will make a pull request to update the documentation soon.

Let me know if this solves your issue entirely or if there is anything else I can help with.