Open spsanderson opened 3 years ago
Hi @spsanderson ,
If you are getting this error it is possibly because some variable is not ok in your base recipe and XGBoost is not accepting it. By not being ok in the base recipe it is being carried over to all the other recipes. For example, if you have a date field or a factor you should remove the date and transform the factor field to dummies for example.
Here is an example reproducing your problem:
splits <- initial_time_split(
m4_monthly
, prop = 0.8
, cumulative = TRUE
)
recipe_base_bad <- recipe(value ~ ., data = training(splits))
recipe_base_ok <- recipe(value ~ ., data = training(splits)) %>%
step_rm(date) %>%
step_dummy(all_nominal_predictors(), one_hot = TRUE)
model_spec_boost <- boost_tree(
mode = "regression",
mtry = round(sqrt(ncol(training(splits)) - 1), 0),
trees = round(sqrt(nrow(training(splits)) - 1), 0),
min_n = round(sqrt(ncol(training(splits)) - 1), 0),
tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
learn_rate = 0.3,
loss_reduction = 0.01
) %>%
set_engine("xgboost")
wfsets <- workflow_set(
preproc = list(
base = recipe_base_ok
),
models = list(
model_spec_boost
),
cross = TRUE
)
wf_fits <- wfsets %>%
modeltime_fit_workflowset(
data = training(splits)
, control = control_fit_workflowset(
allow_par = FALSE
, verbose = TRUE
)
)
Hope it helps
@AlbertoAlmuinha I expect that 4 of the models will fail as they contain a date feature and some have other non-numeric features, but the recipe that I am using in the second example recipe_num_only
works on its on the modeltime
workflow but not inside of the modeltime_fit_workflowsets
which is confusing to me, working on it's own but not in the workflowsets
does not make sense, the error itself is also to me confusing, to me it says it literally cannot find my splits
object.
@spsanderson It's difficult to say without a reprex to play a bit with the data. Yeah, the error description is not the best one, but that part is difficult to control...
data_tbl.xlsx juiced_recipe.xlsx
Please see attached data to help
With which recipe did you create the attached excel? Don't fit with any of the recipes in the first message
recipe_num_only
I don't get that result with recipe_num_only. The first message recipe is:
recipe_num_only <- recipe_pca %>%
step_rm(-value, -all_numeric_predictors())
In the last step you are keeping a "value" column...you don't have any value column in the attached information....so I'm missing something here
@spsanderson @mdancho84
Definitely we need to take a look at this because something is going on. Apparently it should work (and in fact, if you launch it in sequential it works correctly) but the parallel functionality for some reason is not working correctly I think. We need to check this.
For the moment, you can use allow_par = FALSE
to make it work
@AlbertoAlmuinha thanks I will try it right now and let you know
That did it.
@spsanderson ok, I found the problem...Right now you can't define the model based on other variables (splits in this case):
model_spec_boost <- boost_tree(
mode = "regression",
mtry = round(sqrt(ncol(training(splits)) - 1), 0),
trees = round(sqrt(nrow(training(splits)) - 1), 0),
min_n = round(sqrt(ncol(training(splits)) - 1), 0),
tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
learn_rate = 0.3,
loss_reduction = 0.01
) %>%
set_engine("xgboost")
What happens is that these variables are not sent to the nodes where the computation is performed and therefore when the calculation is going to be performed it fails because it does not find them. If you change the variables by a number you will see that everything works correctly:
model_spec_boost <- boost_tree(
mode = "regression",
mtry = 1,
trees = 8,
min_n = 1,
tree_depth = 1,
learn_rate = 0.3,
loss_reduction = 0.01
) %>%
set_engine("xgboost")
Maybe we can find a solution to improve this situation,
Regards
Ahhhhhhhh ok, I don't think a solution to this is necessary, I think it would be better if those settings were made outside of the spec process.
What do you think about this @mdancho84 ?? We could include an "export" argument to modeltime_fit_workflowsets() which would be a object (or a named list if multiples objects are required) and export to the nodes this "export" object. The implementation would be quite easy.
Or do you prefer to leave things as they are?
It's a unique case but we can add an exports arg inside of the control objects. That might help.
I am getting an error of
Error: Error in analysis(x): object 'splits' not found
Splits and Features:
Make the model_spec
Gives the error:
Yet when I do the following:
I get a plot