cimentadaj / tidyflow

A simplified and fresh workflow for doing machine learning with tidymodels
https://cimentadaj.github.io/tidyflow/
Other
8 stars 0 forks source link

`fit` doesn't work with with formula and `rsample` because of character expansion #12

Closed cimentadaj closed 4 years ago

cimentadaj commented 4 years ago

When passing a formula to a tidyflow that has a character variables, hardhat::mold expands character columns to have N columns as the same number of categories in the variable. When this is passed to fit_resamples, it raises an error because these new columns weren't specified in the formula in the first place:

library(tidyflow)
library(parsnip)
library(rsample)
mtcars$gear <- as.character(mtcars$gear)

# Happens because the formula expans `gear` to have three columns
# and then when passed to `fit_resamples`, it says it can't find
# these new columns (as expected, since they weren't defined in
# the formula).
tflow <-
  mtcars %>%
  tidyflow(seed = 23151) %>%
  plug_formula(mpg ~ .) %>%
  plug_resample(vfold_cv) %>%
  plug_model(set_engine(linear_reg(), "lm"))

fit(tflow)
#> 
#> Attaching package: 'tune'
#> The following object is masked from 'package:tidyflow':
#> 
#>     parameters
#> x Fold01: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> x Fold02: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> x Fold03: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> x Fold04: model (predictions): Error in `contrasts<-`(`*tmp*`, value = contr.funs[...
#> x Fold05: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> x Fold06: model (predictions): Error in `contrasts<-`(`*tmp*`, value = contr.funs[...
#> x Fold07: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> x Fold08: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> ! Fold09: model (predictions): prediction from a rank-deficient fit may be misleading
#> x Fold10: model (predictions): Error: Can't subset columns that don't exist.
#> ✖ Col...
#> ══ Tidyflow [tuned] ════════════════════════════════════════════════════════════
#> Data: 32 rows x 11 columns
#> Split: None
#> Formula: mpg ~ .
#> Resample: vfold_cv w/ default args
#> Grid: None
#> Model:
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm 
#> 
#> ══ Results ═════════════════════════════════════════════════════════════════════
#> 
#> Tuning results: 
#> 
#> # A tibble: 5 x 4
#>   splits         id     .metrics         .notes          
#>   <list>         <chr>  <list>           <list>          
#> 1 <split [28/4]> Fold01 <tibble [0 × 3]> <tibble [1 × 1]>
#> 2 <split [28/4]> Fold02 <tibble [0 × 3]> <tibble [1 × 1]>
#> 3 <split [29/3]> Fold03 <tibble [0 × 3]> <tibble [1 × 1]>
#> 4 <split [29/3]> Fold04 <tibble [0 × 3]> <tibble [1 × 1]>
#> 5 <split [29/3]> Fold05 <tibble [0 × 3]> <tibble [1 × 1]>
#> 
#> ... and 5 more lines.

Doesn't happen with a recipe because the recipe doesn't convert character columns to one hot encodings:

# Doesn't happen with recipe
tflow %>%
  drop_formula() %>%
  plug_recipe(~ recipes::recipe(mpg ~ ., data = .)) %>%
  fit()
#> ══ Tidyflow [tuned] ════════════════════════════════════════════════════════════
#> Data: 32 rows x 11 columns
#> Split: None
#> Recipe: available
#> Resample: vfold_cv w/ default args
#> Grid: None
#> Model:
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: lm 
#> 
#> ══ Results ═════════════════════════════════════════════════════════════════════
#> 
#> Tuning results: 
#> 
#> # A tibble: 5 x 4
#>   splits         id     .metrics         .notes          
#>   <list>         <chr>  <list>           <list>          
#> 1 <split [28/4]> Fold01 <tibble [2 × 3]> <tibble [0 × 1]>
#> 2 <split [28/4]> Fold02 <tibble [2 × 3]> <tibble [0 × 1]>
#> 3 <split [29/3]> Fold03 <tibble [2 × 3]> <tibble [0 × 1]>
#> 4 <split [29/3]> Fold04 <tibble [2 × 3]> <tibble [0 × 1]>
#> 5 <split [29/3]> Fold05 <tibble [2 × 3]> <tibble [0 × 1]>
#> 
#> ... and 5 more lines.

Reported at https://github.com/tidymodels/hardhat/issues/139. One solution to this would be to convert character to factors before passing it to mold and then reconvert them back to characters from the result of mold. However, before doing it, I want to make sure mold is working correctly. Perhaps this is just an easy fix from their side.

cimentadaj commented 4 years ago

Odd behavior and same problem but without a resample can be seen here. Trying to fit a bagged tree model with the pisa data. Without specifying a resample, the character VER_DAT gets expanded and fails in an error. I believe this is because bag_tree uses bootstraps behind the scenes, making this the same error as above.

However, this intuition doesn't fit the results since I can convert a column of mtcars to character and fit the same model and it first expanding the columns. Even more surprising, eliminating the VER_DAT column from pisa allows to run the model without a problem:

library(baguette)
#> Loading required package: parsnip
library(tidyflow)
library(rsample)

data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)

mod1 <- set_engine(bag_tree(mode = "regression"),
                   "rpart",
                   times = 3)

tflow <-
  pisa %>%
  tidyflow(seed = 23151) %>%
  plug_split(initial_split) %>%
  plug_formula(math_score ~ .) %>%
  plug_model(mod1)

# The error is undefined columns. After inspecting what's happening,
# it is because the column VER_DAT gets expanded because it's a character
fit(tflow)
#> Error: All of the models failed. An example message was:
#>   Error in `[.data.frame`(m, labs) : undefined columns selected
#> Timing stopped at: 0.844 0.02 0.864

# Works now
pisa$VER_DAT <- NULL
tflow %>%
  replace_data(pisa) %>%
  fit()
#> ══ Tidyflow [trained] ══════════════════════════════════════════════════════════
#> Data: 4.84K rows x 501 columns
#> Split: initial_split w/ default args
#> Formula: math_score ~ .
#> Resample: None
#> Grid: None
#> Model:
#> Bagged Decision Tree Model Specification (regression)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Engine-Specific Arguments:
#>   times = 3
#> 
#> Computational engine: rpart 
#> 
#> ══ Results ═════════════════════════════════════════════════════════════════════
#> 
#> 
#> Fitted model:
#> Bagged CART (regression with 3 members)
#> 
#> Variable importance scores include:
#> 
#> # A tibble: 500 x 4
#> 
#> ...
#> and 14 more lines.

# This is the same problem as the first comment in the issue.
# However, why doesn't this fail?
mtcars$gear <- as.character(mtcars$gear)
tflow %>%
  replace_data(mtcars) %>%
  replace_formula(mpg ~ .) %>%
  fit() %>%
  pull_tflow_fit()
#> parsnip model object
#> 
#> Fit time:  507ms 
#> Bagged CART (regression with 3 members)
#> 
#> Variable importance scores include:
#> 
#> # A tibble: 12 x 4
#>    term    value std.error  used
#>    <chr>   <dbl>     <dbl> <int>
#>  1 disp  748.       178.       3
#>  2 wt    672.       192.       3
#>  3 hp    636.        68.5      3
#>  4 drat  563.       121.       3
#>  5 cyl   218.        51.9      3
#>  6 qsec  122.        30.4      3
#>  7 carb  106.        99.1      2
#>  8 vs     65.3       23.0      3
#>  9 gear4  44.5       38.8      3
#> 10 gear3  22.9       21.6      3
#> 11 am      5.25       4.88     3
#> 12 gear5   0.016      0        1

# In fact, you can see that the column gear gets expanded and
# fitted correctly
cimentadaj commented 4 years ago

tidymodels has worked quite a bit on the one-hot encoding. Here's the summary of what I've read and how it impacts tidyflow.

library(hardhat)
mtcars$gear <- as.character(mtcars$gear)
res <- mold(formula = mpg ~ .,
            data = mtcars,
            blueprint = hardhat::default_formula_blueprint())

res
#> $predictors
#> # A tibble: 32 x 12
#>      cyl  disp    hp  drat    wt  qsec    vs    am gear3 gear4 gear5  carb
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     6  160    110  3.9   2.62  16.5     0     1     0     1     0     4
#>  2     6  160    110  3.9   2.88  17.0     0     1     0     1     0     4
#>  3     4  108     93  3.85  2.32  18.6     1     1     0     1     0     1
#>  4     6  258    110  3.08  3.22  19.4     1     0     1     0     0     1
#>  5     8  360    175  3.15  3.44  17.0     0     0     1     0     0     2
#>  6     6  225    105  2.76  3.46  20.2     1     0     1     0     0     1
#>  7     8  360    245  3.21  3.57  15.8     0     0     1     0     0     4
#>  8     4  147.    62  3.69  3.19  20       1     0     0     1     0     2
#>  9     4  141.    95  3.92  3.15  22.9     1     0     0     1     0     2
#> 10     6  168.   123  3.92  3.44  18.3     1     0     0     1     0     4
#> # … with 22 more rows
#> 
#> $outcomes
#> # A tibble: 32 x 1
#>      mpg
#>    <dbl>
#>  1  21  
#>  2  21  
#>  3  22.8
#>  4  21.4
#>  5  18.7
#>  6  18.1
#>  7  14.3
#>  8  24.4
#>  9  22.8
#> 10  19.2
#> # … with 22 more rows
#> 
#> $blueprint
#> Formula blueprint: 
#>  
#> # Predictors: 10 
#>   # Outcomes: 1 
#>    Intercept: FALSE 
#> Novel Levels: FALSE 
#>   Indicators: traditional 
#> 
#> $extras
#> $extras$offset
#> NULL

It seems that at this moment, this is not entirely stable (https://github.com/tidymodels/tune/issues/262) but it's very close.

How does this affect tidyflow? Well, once https://github.com/tidymodels/tune/issues/262 is fixed, there should be no problem. Users can specify character columns and they will not get expanded.

res

> $predictors

> # A tibble: 32 x 10

> cyl disp hp drat wt qsec vs am carb gear

>

> 1 6 160 110 3.9 2.62 16.5 0 1 4 4

> 2 6 160 110 3.9 2.88 17.0 0 1 4 4

> 3 4 108 93 3.85 2.32 18.6 1 1 1 4

> 4 6 258 110 3.08 3.22 19.4 1 0 1 3

> 5 8 360 175 3.15 3.44 17.0 0 0 2 3

> 6 6 225 105 2.76 3.46 20.2 1 0 1 3

> 7 8 360 245 3.21 3.57 15.8 0 0 4 3

> 8 4 147. 62 3.69 3.19 20 1 0 2 4

> 9 4 141. 95 3.92 3.15 22.9 1 0 2 4

> 10 6 168. 123 3.92 3.44 18.3 1 0 4 4

> # … with 22 more rows

>

> $outcomes

> # A tibble: 32 x 1

> mpg

>

> 1 21

> 2 21

> 3 22.8

> 4 21.4

> 5 18.7

> 6 18.1

> 7 14.3

> 8 24.4

> 9 22.8

> 10 19.2

> # … with 22 more rows

>

> $blueprint

> Formula blueprint:

>

> # Predictors: 10

> # Outcomes: 1

> Intercept: FALSE

> Novel Levels: FALSE

> Indicators: none

>

> $extras

> $extras$offset

> NULL



* What if a user wants to expand factor columns but not expand character columns? This is not clear to me how `tidymodels` is handling this. It seems that with the `indicators` argument we can suppress/allow one-hot encoding but not selectively choose whether factors/characters get each special treatment.
cimentadaj commented 4 years ago

I was confused about the previous comment. It seems that regardless of character or factor, these two classes get expanded as one-hot encoding within models. However, there is no error now since this happens after inside tune_* and fit_*. The previous error was coming from hardhat.

Now that the error is not there, for tidyflow to be user friendly ideally we would need an argument that signals whether we use one-hot encoding for characters, since we assumne one-hot encoding for factors to be the default. Currently, fit_resamples automatically turns character columns into dummies:

library(tidyflow)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(parsnip)
library(hardhat)
library(dials)
#> Loading required package: scales
#> 
#> Attaching package: 'dials'
#> The following object is masked from 'package:tidyflow':
#> 
#>     parameters
library(tune)
#> 
#> Attaching package: 'tune'
#> 
#> The following object is masked from 'package:tidyflow':
#> 
#>     parameters
library(rsample)
library(modeldata)
data(stackoverflow)

tflow <-
  stackoverflow %>%
  select(Salary, Country) %>%
  mutate(Country = as.character(Country)) %>%
  tidyflow(seed = 23151) %>%
  plug_formula(Salary ~ Country) %>%
  plug_resample(vfold_cv) %>%
  plug_model(set_engine(linear_reg(), "lm"))

ctrl <- control_tidyflow(
  control_resamples = control_resamples(extract = identity)
)

res <- tflow %>% fit(control = ctrl)
head(pull_tflow_fit_tuning(res)$.extracts[[1]][[1]][[1]]$fit$fit$fit$model)
#>          ..y CountryGermany CountryIndia CountryUnited Kingdom
#> 1 100000.000              0            0                     1
#> 2 130000.000              0            0                     0
#> 3 175000.000              0            0                     0
#> 4  64516.129              1            0                     0
#> 5   6636.324              0            1                     0
#> 6  65000.000              0            0                     0
#>   CountryUnited States
#> 1                    0
#> 2                    1
#> 3                    1
#> 4                    0
#> 5                    0
#> 6                    1
cimentadaj commented 4 years ago

Now that the fitting engine is done through workflows, this is very simple. If using a formula, depending on the model, character/factor expanding might take place automatically. The user can now override this with a blueprint passed to plug_formula. For details and examples, read the docs of [plug_formula](https://cimentadaj.github.io/tidyflow/reference/plug_formula.html). However, for formulas there is not distinction between characters and factors: both will get the same transformation.

For greater control, the user can just supply a recipe and use functions like step_dummy to select which columns get expanded.