Error with modeltime_fit_workflowsets

spsanderson commented 3 years ago

I am getting an error of Error: Error in analysis(x): object 'splits' not found

Splits and Features:

splits <- initial_time_split(
  data_final_tbl
  , prop = 0.8
  , cumulative = TRUE
)

# Features ----------------------------------------------------------------

recipe_base <- recipe(value ~ ., data = training(splits))

recipe_date <- recipe_base %>%
  step_timeseries_signature(date_col) %>%
  step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) %>%
  step_normalize(contains("index.num"), contains("date_col_year"))

recipe_fourier <- recipe_date %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_fourier(date_col, period = 365/12, K = 1) %>%
  step_YeoJohnson(value, limits = c(0,1))

recipe_fourier_final <- recipe_fourier %>%
  step_nzv(all_predictors())

recipe_pca <- recipe_base %>%
  step_timeseries_signature(date_col) %>%
  step_rm(matches("(iso$)|(xts$)|(hour)|(min)|(sec)|(am.pm)")) %>%
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>%
  step_normalize(value) %>%
  step_fourier(date_col, period = 365/52, K = 1) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_nzv(all_predictors()) %>%
  step_pca(all_numeric_predictors(), threshold = .95)

recipe_num_only <- recipe_pca %>%
  step_rm(-value, -all_numeric_predictors())

Make the model_spec

# XGBoost -----------------------------------------------------------------

model_spec_boost <- boost_tree(
  mode  = "regression",
  mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
  trees = round(sqrt(nrow(training(splits)) - 1), 0),
  min_n = round(sqrt(ncol(training(splits)) - 1), 0),
  tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
  learn_rate = 0.3,
  loss_reduction = 0.01
) %>%
  set_engine("xgboost")

wfsets <- workflow_set(
  preproc = list(
    base          = recipe_base,
    date          = recipe_date,
    fourier       = recipe_fourier,
    fourier_final = recipe_fourier_final,
    pca           = recipe_pca,
    num_only_pca  = recipe_num_only
  ),
  models = list(
    model_spec_boost
  ),
  cross = TRUE
)

parallel_start(n_cores)
wf_fits <- wfsets %>% 
  modeltime_fit_workflowset(
    data = training(splits)
    , control = control_fit_workflowset(
      allow_par = TRUE
      , verbose = TRUE
    )
  )
parallel_stop()

Gives the error:

Using existing parallel backend with 5 clusters (cores)...
 Beginning Parallel Loop | 0.005 seconds
Model 1 Error: Error in analysis(x): object 'splits' not found

Model 2 Error: Error in analysis(x): object 'splits' not found

Model 3 Error: Error in analysis(x): object 'splits' not found

Model 4 Error: Error in analysis(x): object 'splits' not found

Model 5 Error: Error in analysis(x): object 'splits' not found

Model 6 Error: Error in analysis(x): object 'splits' not found

 Finishing parallel backend. Clusters are remaining open. | 2.909 seconds
 Close clusters by running: `parallel_stop()`.
 Total time | 2.909 seconds

-- Model Failure Report ------------------------------------
# A tibble: 6 x 2
  .model_id .model
      <int> <list>
1         1 <NULL>
2         2 <NULL>
3         3 <NULL>
4         4 <NULL>
5         5 <NULL>
6         6 <NULL>

Some models failed during fitting: modeltime_fit_workflowset():
- Model 1: Is NULL.
- Model 2: Is NULL.
- Model 3: Is NULL.
- Model 4: Is NULL.
- Model 5: Is NULL.
- Model 6: Is NULL.

Action: Review any error messages.
-- End Model Failure Report --------------------------------

Yet when I do the following:

model_spec_boost <- boost_tree(
  mode  = "regression",
  mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
  trees = round(sqrt(nrow(training(splits)) - 1), 0),
  min_n = round(sqrt(ncol(training(splits)) - 1), 0),
  tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
  learn_rate = 0.3,
  loss_reduction = 0.01
) %>%
  set_engine("xgboost")

wflw_fit_xgboost <- workflow() %>%
  add_recipe(recipe_num_only) %>%
  add_model(model_spec_boost) %>%
  fit(training(splits))

mdl_tbl <- modeltime_table(wflw_fit_xgboost)

calibration_tbl <- mdl_tbl %>%
  modeltime_calibrate(new_data = testing(splits))

calibration_tbl %>%
  modeltime_forecast(
    new_data = testing(splits)
    , actual_data = data_final_tbl
  ) %>%
  plot_modeltime_forecast(
    .conf_interval_show = FALSE
  )

I get a plot

AlbertoAlmuinha commented 3 years ago

Hi @spsanderson ,

If you are getting this error it is possibly because some variable is not ok in your base recipe and XGBoost is not accepting it. By not being ok in the base recipe it is being carried over to all the other recipes. For example, if you have a date field or a factor you should remove the date and transform the factor field to dummies for example.

Here is an example reproducing your problem:

splits <- initial_time_split(
    m4_monthly
    , prop = 0.8
    , cumulative = TRUE
)

recipe_base_bad <- recipe(value ~ ., data = training(splits))

recipe_base_ok <- recipe(value ~ ., data = training(splits)) %>%
                step_rm(date) %>%
                step_dummy(all_nominal_predictors(), one_hot = TRUE)

model_spec_boost <- boost_tree(
    mode  = "regression",
    mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
    trees = round(sqrt(nrow(training(splits)) - 1), 0),
    min_n = round(sqrt(ncol(training(splits)) - 1), 0),
    tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
    learn_rate = 0.3,
    loss_reduction = 0.01
) %>%
    set_engine("xgboost")

wfsets <- workflow_set(
    preproc = list(
        base          = recipe_base_ok  
    ),
    models = list(
        model_spec_boost
    ),
    cross = TRUE
)

wf_fits <- wfsets %>% 
    modeltime_fit_workflowset(
        data = training(splits)
        , control = control_fit_workflowset(
            allow_par = FALSE
            , verbose = TRUE
        )
    )

Hope it helps

spsanderson commented 3 years ago

@AlbertoAlmuinha I expect that 4 of the models will fail as they contain a date feature and some have other non-numeric features, but the recipe that I am using in the second example recipe_num_only works on its on the modeltime workflow but not inside of the modeltime_fit_workflowsets which is confusing to me, working on it's own but not in the workflowsets does not make sense, the error itself is also to me confusing, to me it says it literally cannot find my splits object.

AlbertoAlmuinha commented 3 years ago

@spsanderson It's difficult to say without a reprex to play a bit with the data. Yeah, the error description is not the best one, but that part is difficult to control...

spsanderson commented 3 years ago

data_tbl.xlsx juiced_recipe.xlsx

Please see attached data to help

AlbertoAlmuinha commented 3 years ago

With which recipe did you create the attached excel? Don't fit with any of the recipes in the first message

spsanderson commented 3 years ago

recipe_num_only

AlbertoAlmuinha commented 3 years ago

I don't get that result with recipe_num_only. The first message recipe is:

recipe_num_only <- recipe_pca %>%
  step_rm(-value, -all_numeric_predictors())

In the last step you are keeping a "value" column...you don't have any value column in the attached information....so I'm missing something here

spsanderson commented 3 years ago

juiced_recipe.xlsx

Sorry here you go column is now there.

The data_tbl is my original data.

AlbertoAlmuinha commented 3 years ago

@spsanderson @mdancho84

Definitely we need to take a look at this because something is going on. Apparently it should work (and in fact, if you launch it in sequential it works correctly) but the parallel functionality for some reason is not working correctly I think. We need to check this.

For the moment, you can use allow_par = FALSE to make it work

spsanderson commented 3 years ago

@AlbertoAlmuinha thanks I will try it right now and let you know

That did it.

AlbertoAlmuinha commented 3 years ago

@spsanderson ok, I found the problem...Right now you can't define the model based on other variables (splits in this case):

model_spec_boost <- boost_tree(
    mode  = "regression",
    mtry  = round(sqrt(ncol(training(splits)) - 1), 0),
    trees = round(sqrt(nrow(training(splits)) - 1), 0),
    min_n = round(sqrt(ncol(training(splits)) - 1), 0),
    tree_depth = round(sqrt(ncol(training(splits)) - 1), 0),
    learn_rate = 0.3,
    loss_reduction = 0.01
) %>%
    set_engine("xgboost")

What happens is that these variables are not sent to the nodes where the computation is performed and therefore when the calculation is going to be performed it fails because it does not find them. If you change the variables by a number you will see that everything works correctly:

model_spec_boost <- boost_tree(
    mode  = "regression",
    mtry  = 1,
    trees = 8,
    min_n = 1,
    tree_depth = 1,
    learn_rate = 0.3,
    loss_reduction = 0.01
) %>%
    set_engine("xgboost")

Maybe we can find a solution to improve this situation,

Regards

spsanderson commented 3 years ago

Ahhhhhhhh ok, I don't think a solution to this is necessary, I think it would be better if those settings were made outside of the spec process.

AlbertoAlmuinha commented 3 years ago

What do you think about this @mdancho84 ?? We could include an "export" argument to modeltime_fit_workflowsets() which would be a object (or a named list if multiples objects are required) and export to the nodes this "export" object. The implementation would be quite easy.

Or do you prefer to leave things as they are?

mdancho84 commented 3 years ago

It's a unique case but we can add an exports arg inside of the control objects. That might help.

business-science / modeltime

Error with modeltime_fit_workflowsets #124