amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
433 stars 107 forks source link

Imputation in high dimensional data by mice package #477

Closed bugravarol closed 2 years ago

bugravarol commented 2 years ago

Hello everybody. In the mice package, I simulated a high-dimensional data and generated missing structures in the first 50 variables (25% per variable). Later, when I want to imput these missing values by the methods in mice, I get warnings as follows. I searched for this but couldn't figure it out. I added the term (ls.meth="ridge") inside the function but I keep getting the warning

I couldn't find where I made a mistake. How can I make correct imputations using the mice package for a high-dimensional dataset that I have simulated in this way?

                                                     out

1 df set to 1. # observed cases: 63 # predictors: 120

2 V2, V3, V4, V6, V9, V14, V15, V17, V19, V23, V25, V26, V28, V30, V31, V34, V36, V38, V39, V46, V49, V52, V53, V54, V64, V66, V67, V68, V70, V73, V79, V82, V84, V89, V91, V93, V94, V95, V99, V100, V101, V106, V107, V108, V109, V111, V117, V120

3 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?

4 df set to 1. # observed cases: 69 # predictors: 120

5 V1, V3, V4, V5, V14, V21, V22, V25, V27, V30, V35, V37, V41, V43, V44, V47, V51, V53, V55, V58, V59, V60, V61, V68, V71, V72, V81, V82, V84, V90, V93, V94, V97, V101, V105, V108, V111, V117

6 mice detected that your data are (nearly) multi-collinear.\nIt applied a ridge penalty to continue calculations, but the results can be unstable.\nDoes your dataset contain duplicates, linear transformation, or factors with unique respondent names?

## MY CODES

install.packages("MASS")
install.packages("stats")
install.packages("mice")

library(MASS)
library(stats)
library(mice)

######################################################
# DATA FUNCTION
######################################################
#  rm(list = ls())
generateData<- function(n,p) {
pr <- seq(0.80, 0.40, length.out = p)
pr[1] <- 1
covmat <- toeplitz(pr)
mu= rep(0,p)
X_ <- data.frame(mvrnorm(n, mu = mu, Sigma = covmat))
X <- unname(as.matrix(sample(X_)))
vCoef = rnorm(ncol(X))
vProb =exp(X%*%vCoef)/(1+exp(X%*%vCoef))
Y <- rbinom(nrow(X), 1, vProb)
mydata= data.frame(cbind(X,Y))
return(mydata)
}

######################################################
# SIMULATED DATA
######################################################
n <- 100
p <- 120
data <- generateData(n , p)
# table(data[ncol(data)])
X <- data[-ncol(data)]
Y <- data[ncol(data)]

###### PATTERN #######
myfreq <- 0.25
pstar <- 50
npat <- 120
mypatterns <- matrix(1, nrow = npat, ncol = p)
for(i in 1:npat){
  idx <- sample(x = 1:pstar, size = myfreq * n, replace = F)
  mypatterns[i,idx] <- 0
}

####### MCAR MECHANISM #########
my_bycases <- TRUE
my_prop <- 0.5
data_mcar <- ampute(X, prop = my_prop, pattern=mypatterns, mech ='MCAR', type="RIGHT", bycases=my_bycases)
data_mcar_missing <- data_mcar$amp

####### IMPPUTATION #########
complete_datasets <- mice(data_mcar_missing, m = 2, defaultMethod = "pmm")
complete_datasets$loggedEvents

# all_imp_data <- mice::complete(complete_datasets, "all")
stefvanbuuren commented 2 years ago

By default, mice relies on linear regression for imputation. The message df set to 1. # observed cases: 63 # predictors: 120 tells you that you have more cases than free parameters in the imputation model. Since we cannot work with negative degrees of freedom (df), mice sets it to the minimal value of 1. In order to move on with the calculations, mice removes predictors one by one. In your case it takes out about 60 variables.

Subsequently, the message mice detected that your data are (nearly) multi-collinear signals that the remaining problem is still overdetermined, so mice takes rescue measures in order not to crash.

Then the story repeats for the next variable, and so on...

In cases like these, use quickpred() to quickly trim down the imputation model. Try methods cart or rf, which are less sensitive to overdetermined systems, or lasso.norm for regression with an L1 penalty.