HealthCatalyst / healthcareai-r

R tools for healthcare machine learning
https://docs.healthcare.ai
Other
245 stars 106 forks source link

machine_learn ignores ordered factors #1296

Open ecorreig opened 3 years ago

ecorreig commented 3 years ago

The function machine_learn outputs models with wrong reference levels for factors. This code

library(dplyr)
library(healthcareai)

sino <- c("No", "Sí")
df <- tibble(
  x = sample(sino, 100, replace = T),
  y = sample(sino, 100, replace = T),
  z = sample(sino, 100, replace = T),
  a = 1:100
) %>%
  mutate(
    across(c(x, y), function(x) factor(x, ordered = T)),
    z = as.factor(z)
    )
mod <- machine_learn(df, outcome = z, models = "rf")

get_variable_importance(mod) %>%
  plot()

gives me:

image

sessionInfo():

R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=Catalan_Spain.1252 LC_CTYPE=Catalan_Spain.1252 LC_MONETARY=Catalan_Spain.1252 [4] LC_NUMERIC=C LC_TIME=Catalan_Spain.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] forcats_0.5.0 caret_6.0-86 lattice_0.20-41 ggplot2_3.3.2 cvAUC_1.1.0
[6] data.table_1.13.0 ROCR_1.0-11 healthcareai_2.5.0 compareGroups_4.4.5 missForest_1.4
[11] itertools_0.1-3 iterators_1.0.12 foreach_1.5.0 randomForest_4.6-14 dplyr_1.0.2