Closed cimentadaj closed 4 years ago
Odd behavior and same problem but without a resample can be seen here. Trying to fit a bagged tree model with the pisa
data. Without specifying a resample, the character VER_DAT
gets expanded and fails in an error. I believe this is because bag_tree
uses bootstraps
behind the scenes, making this the same error as above.
However, this intuition doesn't fit the results since I can convert a column of mtcars
to character and fit the same model and it first expanding the columns. Even more surprising, eliminating the VER_DAT
column from pisa
allows to run the model without a problem:
library(baguette)
#> Loading required package: parsnip
library(tidyflow)
library(rsample)
data_link <- "https://raw.githubusercontent.com/cimentadaj/ml_socsci/master/data/pisa_us_2018.csv"
pisa <- read.csv(data_link)
mod1 <- set_engine(bag_tree(mode = "regression"),
"rpart",
times = 3)
tflow <-
pisa %>%
tidyflow(seed = 23151) %>%
plug_split(initial_split) %>%
plug_formula(math_score ~ .) %>%
plug_model(mod1)
# The error is undefined columns. After inspecting what's happening,
# it is because the column VER_DAT gets expanded because it's a character
fit(tflow)
#> Error: All of the models failed. An example message was:
#> Error in `[.data.frame`(m, labs) : undefined columns selected
#> Timing stopped at: 0.844 0.02 0.864
# Works now
pisa$VER_DAT <- NULL
tflow %>%
replace_data(pisa) %>%
fit()
#> ══ Tidyflow [trained] ══════════════════════════════════════════════════════════
#> Data: 4.84K rows x 501 columns
#> Split: initial_split w/ default args
#> Formula: math_score ~ .
#> Resample: None
#> Grid: None
#> Model:
#> Bagged Decision Tree Model Specification (regression)
#>
#> Main Arguments:
#> cost_complexity = 0
#> min_n = 2
#>
#> Engine-Specific Arguments:
#> times = 3
#>
#> Computational engine: rpart
#>
#> ══ Results ═════════════════════════════════════════════════════════════════════
#>
#>
#> Fitted model:
#> Bagged CART (regression with 3 members)
#>
#> Variable importance scores include:
#>
#> # A tibble: 500 x 4
#>
#> ...
#> and 14 more lines.
# This is the same problem as the first comment in the issue.
# However, why doesn't this fail?
mtcars$gear <- as.character(mtcars$gear)
tflow %>%
replace_data(mtcars) %>%
replace_formula(mpg ~ .) %>%
fit() %>%
pull_tflow_fit()
#> parsnip model object
#>
#> Fit time: 507ms
#> Bagged CART (regression with 3 members)
#>
#> Variable importance scores include:
#>
#> # A tibble: 12 x 4
#> term value std.error used
#> <chr> <dbl> <dbl> <int>
#> 1 disp 748. 178. 3
#> 2 wt 672. 192. 3
#> 3 hp 636. 68.5 3
#> 4 drat 563. 121. 3
#> 5 cyl 218. 51.9 3
#> 6 qsec 122. 30.4 3
#> 7 carb 106. 99.1 2
#> 8 vs 65.3 23.0 3
#> 9 gear4 44.5 38.8 3
#> 10 gear3 22.9 21.6 3
#> 11 am 5.25 4.88 3
#> 12 gear5 0.016 0 1
# In fact, you can see that the column gear gets expanded and
# fitted correctly
tidymodels
has worked quite a bit on the one-hot encoding. Here's the summary of what I've read and how it impacts tidyflow.
library(hardhat)
mtcars$gear <- as.character(mtcars$gear)
res <- mold(formula = mpg ~ .,
data = mtcars,
blueprint = hardhat::default_formula_blueprint())
res
#> $predictors
#> # A tibble: 32 x 12
#> cyl disp hp drat wt qsec vs am gear3 gear4 gear5 carb
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 160 110 3.9 2.62 16.5 0 1 0 1 0 4
#> 2 6 160 110 3.9 2.88 17.0 0 1 0 1 0 4
#> 3 4 108 93 3.85 2.32 18.6 1 1 0 1 0 1
#> 4 6 258 110 3.08 3.22 19.4 1 0 1 0 0 1
#> 5 8 360 175 3.15 3.44 17.0 0 0 1 0 0 2
#> 6 6 225 105 2.76 3.46 20.2 1 0 1 0 0 1
#> 7 8 360 245 3.21 3.57 15.8 0 0 1 0 0 4
#> 8 4 147. 62 3.69 3.19 20 1 0 0 1 0 2
#> 9 4 141. 95 3.92 3.15 22.9 1 0 0 1 0 2
#> 10 6 168. 123 3.92 3.44 18.3 1 0 0 1 0 4
#> # … with 22 more rows
#>
#> $outcomes
#> # A tibble: 32 x 1
#> mpg
#> <dbl>
#> 1 21
#> 2 21
#> 3 22.8
#> 4 21.4
#> 5 18.7
#> 6 18.1
#> 7 14.3
#> 8 24.4
#> 9 22.8
#> 10 19.2
#> # … with 22 more rows
#>
#> $blueprint
#> Formula blueprint:
#>
#> # Predictors: 10
#> # Outcomes: 1
#> Intercept: FALSE
#> Novel Levels: FALSE
#> Indicators: traditional
#>
#> $extras
#> $extras$offset
#> NULL
It seems that at this moment, this is not entirely stable (https://github.com/tidymodels/tune/issues/262) but it's very close.
How does this affect tidyflow
? Well, once https://github.com/tidymodels/tune/issues/262 is fixed, there should be no problem. Users can specify character columns and they will not get expanded.
indicators
set to none
:
library(hardhat)
mtcars$gear <- as.factor(mtcars$gear)
res <- mold(formula = mpg ~ .,
data = mtcars,
blueprint = hardhat::default_formula_blueprint(indicators = "none"))
res
* What if a user wants to expand factor columns but not expand character columns? This is not clear to me how `tidymodels` is handling this. It seems that with the `indicators` argument we can suppress/allow one-hot encoding but not selectively choose whether factors/characters get each special treatment.
I was confused about the previous comment. It seems that regardless of character or factor, these two classes get expanded as one-hot encoding within models. However, there is no error now since this happens after inside tune_*
and fit_*
. The previous error was coming from hardhat.
Now that the error is not there, for tidyflow
to be user friendly ideally we would need an argument that signals whether we use one-hot encoding for characters, since we assumne one-hot encoding for factors to be the default. Currently, fit_resamples
automatically turns character columns into dummies:
library(tidyflow)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(parsnip)
library(hardhat)
library(dials)
#> Loading required package: scales
#>
#> Attaching package: 'dials'
#> The following object is masked from 'package:tidyflow':
#>
#> parameters
library(tune)
#>
#> Attaching package: 'tune'
#>
#> The following object is masked from 'package:tidyflow':
#>
#> parameters
library(rsample)
library(modeldata)
data(stackoverflow)
tflow <-
stackoverflow %>%
select(Salary, Country) %>%
mutate(Country = as.character(Country)) %>%
tidyflow(seed = 23151) %>%
plug_formula(Salary ~ Country) %>%
plug_resample(vfold_cv) %>%
plug_model(set_engine(linear_reg(), "lm"))
ctrl <- control_tidyflow(
control_resamples = control_resamples(extract = identity)
)
res <- tflow %>% fit(control = ctrl)
head(pull_tflow_fit_tuning(res)$.extracts[[1]][[1]][[1]]$fit$fit$fit$model)
#> ..y CountryGermany CountryIndia CountryUnited Kingdom
#> 1 100000.000 0 0 1
#> 2 130000.000 0 0 0
#> 3 175000.000 0 0 0
#> 4 64516.129 1 0 0
#> 5 6636.324 0 1 0
#> 6 65000.000 0 0 0
#> CountryUnited States
#> 1 0
#> 2 1
#> 3 1
#> 4 0
#> 5 0
#> 6 1
[x] Check out the way how workflows implemented the flag for switching on/off one-hot encoding to see how you can suppress one-hot encoding for character columns. Ideally, we can have a single arg, something like: chr_one_hot = FALSE
or TRUE
. This would mean that the data being used for fitting is not expanded to dummies.
[x] Add a test where we check that for characters and factors the columns are expanded
[x] Add a test to check that when signalling chr_one_hot
, character columns are not expanded
[x] Update docs in plug_formula
specifying that character and factors are always expanded except when chr_one_hot
is specified
[x] How does recipes play in all of this? Do characters get expanded? Do factors get expanded? Can we change this with chr_one_hot
?
[x] Since all of the errors that were being produced in this issue are now fixed, we need to update the DESCRIPTION to the latest Github version of the dependencies
Now that the fitting engine is done through workflows
, this is very simple. If using a formula, depending on the model, character/factor expanding might take place automatically. The user can now override this with a blueprint passed to plug_formula
. For details and examples, read the docs of [plug_formula](https://cimentadaj.github.io/tidyflow/reference/plug_formula.html)
. However, for formulas there is not distinction between characters and factors: both will get the same transformation.
For greater control, the user can just supply a recipe and use functions like step_dummy
to select which columns get expanded.
When passing a formula to a
tidyflow
that has a character variables,hardhat::mold
expands character columns to haveN
columns as the same number of categories in the variable. When this is passed tofit_resamples
, it raises an error because these new columns weren't specified in the formula in the first place:Doesn't happen with a recipe because the recipe doesn't convert character columns to one hot encodings:
Reported at https://github.com/tidymodels/hardhat/issues/139. One solution to this would be to convert character to factors before passing it to
mold
and then reconvert them back to characters from the result of mold. However, before doing it, I want to make suremold
is working correctly. Perhaps this is just an easy fix from their side.