Closed simonpcouch closed 1 year ago
Hi Simon, thanks for this report.
A quick workaround is to state variable_splits
explicitly:
# ok
ingredients::ceteris_paribus(
explainer_rf,
explainer_rf$data,
variable_splits = list(Year_Built=unique(vip_train$Year_Built))
)
An error occurs due to the default calculate_variable_split()
# error
ingredients::ceteris_paribus(
explainer_rf,
explainer_rf$data,
variable_splits = ingredients:::calculate_variable_split.default(explainer_rf$data, variables=c("Year_Built"))
)
# float, not an integer
ingredients:::calculate_variable_split.default(explainer_rf$data, variables=c("Year_Built"))
Fixing this issue requires adding !is.integer(selected_column)
to https://github.com/ModelOriented/ingredients/blob/a44ad390cf07c4a9d520ce1213e4c57ae9164586/R/calculate_variable_profile.R#L85
which would lead to treating integer
features like categorical features with unique()
.
# ok
ingredients::ceteris_paribus(
explainer_rf,
explainer_rf$data,
variable_splits = list(Year_Built=unique(vip_train$Year_Built))
)
@pbiecek what do you think?
Thanks for tracking down this tricky error!
Treating an integer as a categorical variable is a good idea, as long as it doesn't have too many different levels (e.g. someone has a column with an ids and there are 10000 different values in it, that would kill our profile calculation). So maybe an extra condition in the if statement - if there is an integer variable and the number of different values is under 100 then treated as categorical (i.e. unique) but if there are a lot of values it is converted to float?
I implemented the fix, and it actually still fails ungracefully in the above scenario, because there are 113 unique year values.
This got me thinking that with categorical variables, we don't have a threshold on how many unique values there should be.
We can either:
grid_points
(=101
by default):
grid_points
when the threshold is reached. This breaks some previous code but improves the quality of the user's experience interacting with our API.great idea, Let's do 1 with additional warning if there is more than 201 unique values
This is hopefully fixed now on CRAN
The tidymodels team recently introduced support for finer-grained numeric classes in recipes. A user recently pointed on our community forums that this introduced issues with
model_profile()
in some cases. Here's a reprex:Created on 2022-12-05 with reprex v2.0.2
The issue arises here, where the numeric
split_points
are dropped into the (possibly) integervariable
:https://github.com/ModelOriented/ingredients/blob/a44ad390cf07c4a9d520ce1213e4c57ae9164586/R/calculate_variable_profile.R#L39