egenn / rtemis

Advanced Machine Learning and Visualization
https://rtemis.org
GNU General Public License v3.0
137 stars 19 forks source link

preprocess impute missing cases: Error when using missRanger #25

Closed jonas-sk closed 4 years ago

jonas-sk commented 4 years ago

When using the preprocess command with impute = TRUE and otherwise default values (i.e. impute.type ="missRanger"), the following error occurs:

Error in `[.data.frame`(data, , relevantVars[[1]], drop = FALSE) : undefined columns selected`

The error does not appear when using missForest

egenn commented 4 years ago

Hi, this is also likely caused by missing dependencies. I added checks for each imputation method: 08f006727cb4f78266172c68bdc94ae9dfafa6b5 Let me know if that helps. Thanks

jonas-sk commented 4 years ago

Thank you for the quick answer! Dependencies are installed and I didn't get an error after updating rtemis and re-running the code.

The error appears when you recode any of the -88, -77, -99 columns in the data set uploaded in issue #24 to NA_character_,

egenn commented 4 years ago

You need to provide a minimal reproducible example. That should be NA, not NA_character_, the latter converts your numeric column to character. Even then, impute works for me.

egenn commented 4 years ago

Looking at the data, I guess these should all be converted to factors, right?

jonas-sk commented 4 years ago

Apologies, I totally forgot to provide a reproducible example. The reason why I initially used NA_character_ is that my original table still had labels behind the values, which I removed in a previous step, so I was just respecting the column variable types. However, even using the table I have sent you, I still get the same error. The following reproduces the error for me:

read_csv("cases_test.csv") %>% 
  mutate_all(list(~ dplyr::recode(.,`-99` = NA_real_,
                                  `-88` = NA_real_,
                                   `-77` = NA_real_
                                  ))) %>% 
  preprocess(impute = TRUE, numeric2factor = TRUE)

This is the full output:

Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.
[2020-06-28 13:47:01 preprocess] Converting numeric to factor 
[2020-06-28 13:47:01 preprocess] Imputing missing values using missRanger... 

Missing value imputation by random forests
Error in `[.data.frame`(data, , relevantVars[[1]], drop = FALSE) : 
  undefined columns selected

Removing numeric2factor = TRUE leads to the same error.

Sidenote: If you use NA instead of NA_real_ (or NA_character_), the whole table, at least in my case, will consist of NAs.

egenn commented 4 years ago

This is a readr + missRanger issue: missRanger cannot handle column names beginning with numbers, which is generally best avoided in R. Base R read.csv adds an X in front of the column name in those cases, read_csv does not.

This works:

read.csv("cases_test.csv") %>% 
  mutate_all(list(~ dplyr::recode(.,`-99` = NA_integer_,
                                  `-88` = NA_integer_,
                                  `-77` = NA_integer_
  ))) %>% 
  preprocess(impute = T, numeric2factor = T) -> dat

and in base:

dat <- read.csv("cases_test.csv")
dat[dat == "-99"] <- dat[dat == "-88"] <- dat[dat == "-77"] <- NA
dat <- preprocess(dat, numeric2factor = T, impute = T)