Closed topepo closed 2 years ago
One other thing that we thought of... how do you use the non-standard roles? If you want them around for specific data sets, you might avoid adding them to the recipe then get them back using the augment()
method on the workflow.
We're not sure of the context.
I'm excited to see hardhat being updated :)
I do not have a strong opinion regarding option 1 versus 2 – I do not know enough abt blueprint, but I like the idea. Im also up for updating this code. Some thoughts:
Thanks a lot for the help!
(I'm writing a second edition to our modeling book - that literature would be good to read!)
So I suggest using .fit_pre()
and but don't include the id columns in the recipe (but get them later).
Here's an example:
library(tidymodels)
tidymodels_prefer()
theme_set(theme_bw())
data(cells)
# just to make the output smaller:
cells <- cells[, 1:10]
# .fit_pre() with non-standard roles
rec <-
recipe(class ~ ., data = cells) %>%
update_role(case, new_role = "case") %>%
step_pca(all_numeric_predictors())
# with new version of workflow (1.0.0) targeted for release next week ...
custom_bp <- hardhat::default_recipe_blueprint(bake_dependent_roles = "case")
wflow <-
workflow() %>%
add_recipe(rec, blueprint = custom_bp) %>%
add_model(logistic_reg())
wflow_fit <- wflow %>% .fit_pre(data = cells)
wflow_fit %>%
extract_recipe() %>%
bake(new_data = cells, all_predictors()) %>%
ncol()
#> [1] 5
wflow_fit <- wflow_fit %>% .fit_model(control = control_workflow())
# Don't include case (= non-standard role in recipe) and get it later:
rec <-
recipe(class ~ ., data = cells %>% select(-case)) %>%
step_pca(all_numeric_predictors())
wflow <-
workflow() %>%
add_recipe(rec) %>%
add_model(logistic_reg())
wflow_fit <- wflow %>% .fit_pre(data = cells)
wflow_fit %>%
extract_recipe() %>%
bake(new_data = cells, all_predictors()) %>%
ncol()
#> [1] 5
# get model fit
wflow_fit <- wflow_fit %>% .fit_model(control = control_workflow())
# get the id's values back with augment
pred <- wflow_fit %>% augment(new_data = cells)
names(pred)
#> [1] "case" "class"
#> [3] "angle_ch_1" "area_ch_1"
#> [5] "avg_inten_ch_1" "avg_inten_ch_2"
#> [7] "avg_inten_ch_3" "avg_inten_ch_4"
#> [9] "convex_hull_area_ratio_ch_1" "convex_hull_perim_ratio_ch_1"
#> [11] ".pred_class" ".pred_PS"
#> [13] ".pred_WS"
Created on 2022-06-17 by the reprex package (v2.0.1)
This would affect any recipes with the non-standard role (so random forest too I think)
I'm happy to read the second edition of the book – when will it be out?
I am not sure what the best way forward here is?
It will take me some time to implement this, and I would probably need the new version of hardhat to test it out on.
Should I perhaps submit a new version of the text package to CRAN where I just force hardhat == 1.0.0? (to give time to update the package for the newer hardhat version)
We made some changes on the recipe side that removes the failure for your package so we should be good (without you having to do anything else).
I'm happy to read the second edition of the book – when will it be out?
I think it'll take us a year or two plus the 6 months or so for the productions. Hopefully less than that but we'll see!
New versions of recipe and workflows are going to CRAN soon (maybe this week). There are some breaking changes that affect text.
In the new versions, if you are using non-standard roles, we advise using a blueprint and declaring the roles via a blueprint. For example:
I forked text and was going to submit a PR but it's not as trivial as I thought it would be. This is mostly due to the code where the recipe is prepared and the number of predictors is counted. That data is used to set the model type. The issue is that
prep()
will error (since it callsbake()
) and there are non-standard roles.There are two options at this point (as far as I can tell). I'll be happy to put in a PR and would prefer option 2 but wanted to see what your thoughts would be.
option 1
First, we could stuff the initial recipe used to determine the number of predictors into a workflow with a blueprint then use
workflows::.fit_pre()
toprep()
the recipe. You can then get the number of predictors.option 2
Second is to not count the predictors and tune both the recipe and model. I haven't completely examined the code but it looks like a glmnet model is being tuned over penalty and mixture (for the non-random forest bits). I think that you need to count the predictors so that foment won''t fail on a single predictor (I know, weird). Alternatively, you could tag
num_comp
for tuning, adjust the parameter range so at least 2 PCs are chosen and tune at the same time as glmnet. That might yield better results too.Short aside: I'm not sure that you even need the PCs for the generalized linear models. If the PCs are generated to avoid collinear predictors, glmnet is very effective in doing that with the original data (that's what it was built to do). Decomposing into PCs might loose some performance and is a bit redundant since you are using glmnet. Of course, I don't know a lot about what your goals are so I might be missing something.
Arguably, the same process could be used of rate other model(s). There's not much computation overhead for tuning the model parameters along with the recipe.