amices / mice

Multivariate Imputation by Chained Equations
https://amices.org/mice/
GNU General Public License v2.0
433 stars 107 forks source link

Multilevel imputation does not accept character or factor variable as the cluster variable; must be integer #657

Open isaactpetersen opened 1 month ago

isaactpetersen commented 1 month ago

Multilevel imputation does not appear to accept a character or factor variable as the cluster variable. It appears that the cluster variable must be integer. Note, when using 2l.pmm/miceadds, I receive the same error as documented in the MICE discussion here, so the reproducible example below could potentially explain why those users were experiencing the issue.

Here is a reprex (adapted from the MICE vignette here):

library("mice")
#> 
#> Attaching package: 'mice'
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind
library("miceadds")
#> * miceadds 3.17-44 (2024-01-08 19:08:24)

# D
con <- url("https://www.gerkovink.com/mimp/popular.RData")
load(con)

dataToImpute <- popNCR2

# Specify variables to impute
Y <- "popular"

# Imputation method
meth <- make.method(dataToImpute)
meth[1:length(meth)] <- ""

# Specify predictor matrix
pred <- make.predictorMatrix(dataToImpute)
pred[1:nrow(pred), 1:ncol(pred)] <- 0
pred[Y, "class"] <- (-2) # cluster variable
pred[Y, "extrav"] <- 1 # fixed effect predictor
diag(pred) <- 0

pred
#>          pupil class extrav sex texp popular popteach
#> pupil        0     0      0   0    0       0        0
#> class        0     0      0   0    0       0        0
#> extrav       0     0      0   0    0       0        0
#> sex          0     0      0   0    0       0        0
#> texp         0     0      0   0    0       0        0
#> popular      0    -2      1   0    0       0        0
#> popteach     0     0      0   0    0       0        0

# Character
dataToImpute$class <- as.character(dataToImpute$class)

meth[Y] <- "2l.norm"
imp1 <- mice(dataToImpute, pred = pred, meth = meth, maxit = 5, print = FALSE)
#> Error in mice.impute.2l.norm(y = c(6.3, 4.9, 5.3, 4.7, 4.5, 4.7, 5.9, : No class variable

meth[Y] <- "2l.pmm"
imp2 <- mice(dataToImpute, pred = pred, meth = meth, maxit = 5, print = FALSE)
#> Error in str2lang(x): <text>:1:24: unexpected ')'
#> 1: dv._lmer ~ 1+extrav+(1|)
#>                            ^

# Factor
dataToImpute$class <- as.factor(dataToImpute$class)

meth[Y] <- "2l.norm"
imp3 <- mice(dataToImpute, pred = pred, meth = meth, maxit = 5, print = FALSE)
#> Error in check.cluster(data, predictorMatrix): Convert cluster variable class to integer by as.integer()

meth[Y] <- "2l.pmm"
imp4 <- mice(dataToImpute, pred = pred, meth = meth, maxit = 5, print = FALSE)
#> Error in check.cluster(data, predictorMatrix): Convert cluster variable class to integer by as.integer()

# Integer
dataToImpute$class <- as.integer(dataToImpute$class)

meth[Y] <- "2l.norm"
imp5 <- mice(dataToImpute, pred = pred, meth = meth, maxit = 5, print = FALSE)

meth[Y] <- "2l.pmm"
imp6 <- mice(dataToImpute, pred = pred, meth = meth, maxit = 5, print = FALSE)

sessionInfo()
#> R version 4.3.1 (2023-06-16 ucrt)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 11 x64 (build 22631)
#> 
#> Matrix products: default
#> 
#> 
#> Random number generation:
#>  RNG:     Mersenne-Twister 
#>  Normal:  Inversion 
#>  Sample:  Rounding 
#>  
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 
#> [2] LC_CTYPE=English_United States.utf8   
#> [3] LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.utf8    
#> 
#> time zone: America/Chicago
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] miceadds_3.17-44 mice_3.16.0     
#> 
#> loaded via a namespace (and not attached):
#>  [1] utf8_1.2.4        generics_0.1.3    tidyr_1.3.1       shape_1.4.6.1    
#>  [5] lattice_0.22-6    lme4_1.1-35.5     digest_0.6.36     magrittr_2.0.3   
#>  [9] mitml_0.4-5       evaluate_0.24.0   grid_4.3.1        iterators_1.0.14 
#> [13] fastmap_1.2.0     foreach_1.5.2     jomo_2.7-6        glmnet_4.1-8     
#> [17] Matrix_1.6-5      nnet_7.3-19       backports_1.5.0   DBI_1.2.3        
#> [21] survival_3.7-0    purrr_1.0.2       fansi_1.0.6       codetools_0.2-20 
#> [25] cli_3.6.3         mitools_2.4       rlang_1.1.4       splines_4.3.1    
#> [29] reprex_2.1.1      withr_3.0.0       yaml_2.3.10       pan_1.9          
#> [33] tools_4.3.1       nloptr_2.1.1      minqa_1.2.7       dplyr_1.1.4      
#> [37] boot_1.3-30       broom_1.0.6       vctrs_0.6.5       R6_2.5.1         
#> [41] rpart_4.1.23      lifecycle_1.0.4   fs_1.6.4          MASS_7.3-60.0.1  
#> [45] pkgconfig_2.0.3   pillar_1.9.0      glue_1.7.0        Rcpp_1.0.13      
#> [49] xfun_0.46         tibble_3.2.1      tidyselect_1.2.1  rstudioapi_0.16.0
#> [53] knitr_1.48        htmltools_0.5.8.1 nlme_3.1-165      rmarkdown_2.27   
#> [57] compiler_4.3.1

Created on 2024-07-31 with reprex v2.1.1

stefvanbuuren commented 3 weeks ago

Thanks for your note. This is indeed a problem case that is not correctly caught.

The problem is caused by the automatic removal of character variables at initialization. mice writes a message of such removals to the loggedEvents. However, we never see these messages because the program crashes and does not return a mids object.

More generally, the handling of cluster variables could be improved, and better support could be provided for factor, character, integer and numeric cluster variables.

Something for the wish list. Not a priority for me right now, but I'd be happy to take any pull requests.