cimentadaj / tidyflow

A simplified and fresh workflow for doing machine learning with tidymodels
https://cimentadaj.github.io/tidyflow/
Other
8 stars 0 forks source link

Finalizing tuning grid with rbg_sigma fails whe data is not numeric #2

Open cimentadaj opened 4 years ago

cimentadaj commented 4 years ago

There is a small error coming in from dials::finalize when attempting to finalize rbf_sigma with the code below:

library(tidymodels)
devtools::load_all()
library(mlbench)
data(Ionosphere)
Ionosphere <- Ionosphere %>% select(-V2) %>% mutate(cont = 1:nrow(.)) %>% as_tibble()

svm_mod <-
  svm_rbf(cost = tune(), rbf_sigma = tune()) %>%
  set_mode("classification") %>%
  set_engine("kernlab")

iono_rec <-
  ~ recipe(Class ~ ., data = .)  %>%
    step_zv(all_predictors()) %>%
    step_mutate(V1 = factor(V1), Class = factor(Class)) %>% 
    step_dummy(V1) %>% 
    step_range(matches("V1_")) %>% 
    step_ns(cont, deg_free = tune())

tflow <-
  Ionosphere %>%
  tidyflow(seed = 4943) %>%
  plug_split(initial_split) %>% 
  plug_recipe(iono_rec) %>%
  plug_resample(bootstraps, times = 30) %>%
  plug_model(svm_mod) %>% 
  plug_grid(grid_latin_hypercube,
            cost = cost(c(-10, 10)),
            size = 1)

t1 <- tflow %>% fit()
## Error: The matrix version of the initialization data is not numeric.
## Run `rlang::last_error()` to see where the error occurred.

I've identified the problem but can't come with an elegant solution right now. The problem is this:

dials::finalize requires the data to be entirely numeric to estimate values of rbg_sigma. The problem is that the column V1 is a factor and raises the error. This won't be fixed if the user converts V1 to numeric in the recipe because the recipe cannot be preped/juiced in tidyflow given that it has a tune placeholder. On the other hand, if the user converts V1 to numeric outside the recipe, it raises another error in recipe because step_dummy requires V1 to be a factor.

After understanding this problem, there is a fix but I don't like it because it's not intuitive at all. The solution is to convert V1 to numeric outside the recipe (this data is that one that will be passed to dials::finalize) and then convert V1 to factor with step_mutate before passing it to step_dummy.

Possible solution that I've thought about but that I've discarded:

At this point, I'm putting it here so I can organize my ideas but I don't know how to fix this elegantly without adding an exception for finalize

cimentadaj commented 4 years ago

Thinking about it even more, the current implementation is error prone. For example, mtry is estimated from the columns in the data. If the user specifies tune in the recipe but at the same time removes some columns in the recipe, dials::finalize will be applied on the data without the recipe. This means that mtry will be updated using the old data rather than the prepped data.

Note that dials::finalize should work fine as long as there are no tune values in the recipe because then the mold will have the prepped data. The only problem arises when the recipe cannot be prepped due to having a tune specification.