OscarKjell / text

Using Transformers from HuggingFace in R
https://r-text.org
139 stars 31 forks source link

non-standard roles #21

Closed topepo closed 2 years ago

topepo commented 2 years ago

New versions of recipe and workflows are going to CRAN soon (maybe this week). There are some breaking changes that affect text.

In the new versions, if you are using non-standard roles, we advise using a blueprint and declaring the roles via a blueprint. For example:

xy_recipe <- data_train %>%
  recipe(y ~ .) %>%
  update_role(id_nr, new_role = "id variable") %>%
  update_role(-id_nr, new_role = "predictor") %>%
  update_role(y, new_role = "outcome")

bp <- hardhat::default_recipe_blueprint(bake_dependent_roles = "id variable")

workflow() %>%
  add_recipe(xy_recipe, blueprint = bp)

I forked text and was going to submit a PR but it's not as trivial as I thought it would be. This is mostly due to the code where the recipe is prepared and the number of predictors is counted. That data is used to set the model type. The issue is that prep() will error (since it calls bake()) and there are non-standard roles.

There are two options at this point (as far as I can tell). I'll be happy to put in a PR and would prefer option 2 but wanted to see what your thoughts would be.

option 1

First, we could stuff the initial recipe used to determine the number of predictors into a workflow with a blueprint then use workflows::.fit_pre() to prep() the recipe. You can then get the number of predictors.

option 2

Second is to not count the predictors and tune both the recipe and model. I haven't completely examined the code but it looks like a glmnet model is being tuned over penalty and mixture (for the non-random forest bits). I think that you need to count the predictors so that foment won''t fail on a single predictor (I know, weird). Alternatively, you could tag num_comp for tuning, adjust the parameter range so at least 2 PCs are chosen and tune at the same time as glmnet. That might yield better results too.

Short aside: I'm not sure that you even need the PCs for the generalized linear models. If the PCs are generated to avoid collinear predictors, glmnet is very effective in doing that with the original data (that's what it was built to do). Decomposing into PCs might loose some performance and is a bit redundant since you are using glmnet. Of course, I don't know a lot about what your goals are so I might be missing something.

Arguably, the same process could be used of rate other model(s). There's not much computation overhead for tuning the model parameters along with the recipe.

topepo commented 2 years ago

One other thing that we thought of... how do you use the non-standard roles? If you want them around for specific data sets, you might avoid adding them to the recipe then get them back using the augment() method on the workflow.

We're not sure of the context.

OscarKjell commented 2 years ago

I'm excited to see hardhat being updated :)

I do not have a strong opinion regarding option 1 versus 2 – I do not know enough abt blueprint, but I like the idea. Im also up for updating this code. Some thoughts:

  1. Literature is finding that using a PCA before ridge regression can improve predictive accuracy. (but selecting at least two PCAs components sounds good).
  2. It is rather important to know how many components that was used, which is part of the result-output.
  3. Not sure I fully understand the last comment – but at the moment the roles are mostly for keeping the ID number in their, which potentially can be added later (as long as it works when removing NAs – and be combined with the original data).
  4. I reckon (some of) these changes are needed for the textTrainRandomForrest function as well?

Thanks a lot for the help!

topepo commented 2 years ago

(I'm writing a second edition to our modeling book - that literature would be good to read!)

So I suggest using .fit_pre() and but don't include the id columns in the recipe (but get them later).

Here's an example:

library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())

data(cells)

# just to make the output smaller:
cells <- cells[, 1:10]
# .fit_pre() with non-standard roles

rec <- 
  recipe(class ~ ., data = cells) %>% 
  update_role(case, new_role = "case") %>% 
  step_pca(all_numeric_predictors())

# with new version of workflow (1.0.0) targeted for release next week ...
custom_bp <- hardhat::default_recipe_blueprint(bake_dependent_roles = "case")

wflow <- 
  workflow() %>% 
  add_recipe(rec, blueprint = custom_bp) %>% 
  add_model(logistic_reg())

wflow_fit <- wflow %>% .fit_pre(data = cells)
wflow_fit %>% 
  extract_recipe() %>% 
  bake(new_data = cells, all_predictors()) %>% 
  ncol()
#> [1] 5

wflow_fit <- wflow_fit %>% .fit_model(control = control_workflow())
# Don't include case (= non-standard role in recipe) and get it later:

rec <- 
  recipe(class ~ ., data = cells %>% select(-case)) %>% 
  step_pca(all_numeric_predictors())

wflow <- 
  workflow() %>% 
  add_recipe(rec) %>% 
  add_model(logistic_reg())

wflow_fit <- wflow %>% .fit_pre(data = cells)
wflow_fit %>% 
  extract_recipe() %>% 
  bake(new_data = cells, all_predictors()) %>% 
  ncol()
#> [1] 5

# get model fit
wflow_fit <- wflow_fit %>% .fit_model(control = control_workflow())

# get the id's values back with augment

pred <- wflow_fit %>% augment(new_data = cells)
names(pred)
#>  [1] "case"                         "class"                       
#>  [3] "angle_ch_1"                   "area_ch_1"                   
#>  [5] "avg_inten_ch_1"               "avg_inten_ch_2"              
#>  [7] "avg_inten_ch_3"               "avg_inten_ch_4"              
#>  [9] "convex_hull_area_ratio_ch_1"  "convex_hull_perim_ratio_ch_1"
#> [11] ".pred_class"                  ".pred_PS"                    
#> [13] ".pred_WS"

Created on 2022-06-17 by the reprex package (v2.0.1)

This would affect any recipes with the non-standard role (so random forest too I think)

OscarKjell commented 2 years ago

I'm happy to read the second edition of the book – when will it be out?

I am not sure what the best way forward here is?

It will take me some time to implement this, and I would probably need the new version of hardhat to test it out on.

Should I perhaps submit a new version of the text package to CRAN where I just force hardhat == 1.0.0? (to give time to update the package for the newer hardhat version)

topepo commented 2 years ago

We made some changes on the recipe side that removes the failure for your package so we should be good (without you having to do anything else).

I'm happy to read the second edition of the book – when will it be out?

I think it'll take us a year or two plus the 6 months or so for the productions. Hopefully less than that but we'll see!