FarrellDay / miceRanger

miceRanger: Fast Imputation with Random Forests in R
Other
67 stars 12 forks source link

Imputation of new data fails when valueSelector = "meanMatch" (on diamonds dataset) #17

Closed sibipx closed 2 years ago

sibipx commented 2 years ago

This problems happens on diamonds dataset and I am unsure why. The problem only happens when valueSelector = "meanMatch". It works fine with valueSelector = "value".

If there is any way to workaround this problem (other than not using PMM), please let me know.

See example below.

Thanks!

> library(miceRanger)
> library(ggplot2)
> 
> data(diamonds)
> 
> diamonds$cut <- factor(diamonds$cut, ordered = FALSE)
> diamonds$color <- factor(diamonds$color, ordered = FALSE)
> diamonds$clarity <- factor(diamonds$clarity, ordered = FALSE)
> 
> # split train / test
> N <- nrow(diamonds)
> n_test <- floor(N/3)
> 
> set.seed(2022)
> id_test <- sample(1:N, n_test)
> 
> data_train <- diamonds[-id_test,]
> data_test <- diamonds[id_test,]
> 
> data_train_miss <- amputeData(data_train, perc = 0.3)
> data_test_miss <- amputeData(data_test, perc = 0.3)
> miceRanger_imp_model <- miceRanger::miceRanger(data_train_miss, m = 2, maxiter = 2,
+                                                valueSelector = "meanMatch",
+                                                returnModels = TRUE,
+                                                verbose = TRUE)

Process started at 2022-05-19 17:51:40 

dataset 1 
iteration 1      | carat | cut | color | clarity | depth | table | price | x | y | z
iteration 2      | carat | cut | color | clarity | depth | table | price | x | y | z

dataset 2 
iteration 1      | carat | cut | color | clarity | depth | table | price | x | y | z
iteration 2      | carat | cut | color | clarity | depth | table | price | x | y | z
> 
> data_test_imp_miceRanger <- miceRanger::impute(data_test_miss,
+                                                miceRanger_imp_model, verbose = TRUE)
Error in miceObj$finalImps[[x]] : no such index at level 1
samFarrellDay commented 2 years ago

Looks like this is a data.table scoping problem that occurs in completeData. Very weird this occurs for the carat column and not for any others.

samFarrellDay commented 2 years ago

I've pushed a fix, version 1.5.1. It's not on CRAN yet. @sibipx can you install from github and ensure you see it is fixed too?

sibipx commented 2 years ago

it looks fine now, thanks!

> miceRanger_imp_model <- miceRanger::miceRanger(data_train_miss, m = 2, maxiter = 2,
+                                                valueSelector = "meanMatch",
+                                                returnModels = TRUE,
+                                                verbose = TRUE)

Process started at 2022-05-20 11:02:43 

dataset 1 
iteration 1      | carat | cut | color | clarity | depth | table | price | x | y | z
iteration 2      | carat | cut | color | clarity | depth | table | price | x | y | z

dataset 2 
iteration 1      | carat | cut | color | clarity | depth | table | price | x | y | z
iteration 2      | carat | cut | color | clarity | depth | table | price | x | y | z
> data_test_imp_miceRanger <- miceRanger::impute(data_test_miss,
+                                                miceRanger_imp_model, verbose = TRUE)

dataset 1 
iteration 1      | carat | cut | color | clarity | depth | table | price | x | y | z
iteration 2      | carat | cut | color | clarity | depth | table | price | x | y | z

dataset 2 
iteration 1      | carat | cut | color | clarity | depth | table | price | x | y | z
iteration 2      | carat | cut | color | clarity | depth | table | price | x | y | z