Open mdancho84 opened 3 years ago
If we want to extract other models from the leaderboard. We can do as
model_ids <- as.vector(aml_results@leaderboard$model_id)
# to get second model from leaderboard
index <- 2
model_2 <- h2o.getModel(model_ids[index])
preds <- predict(model_2 , test_h2o) %>% as_tibble() %>% pull(predict)
I've made a proof of concept and I have implemented a very first version of automl_reg (still without the predict functionality). The problem is that I have realized that I am not being able to reproduce the results given in this issue, because the best model I get does not match this best model (which should not happen because a seed is being used). I have observed that when I execute the following code (and also the original copied in this issue):
model <- automl_reg(mode = 'regression') %>%
parsnip::set_engine('h2o',
max_runtime_secs = 30,
max_runtime_secs_per_model = 30,
project_name = 'project_01',
nfolds = 5,
max_models = 1000,
exclude_algos = c("DeepLearning"),
seed = 786) %>%
parsnip::fit(Weekly_Sales ~ ., data = train_tbl)
some messages appear that I can't see in Matt's examle ...For example:
1 - "AutoML: XGBoost is not available; skipping it" 2 - "Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted."
could you share your sessionInfo to see what is happening? I also attach the results I get.
@AlbertoAlmuinha I'm not super concerned about the reproducibility. We can solve this down the road. A couple of reasons for the differences - Each time your run AutoML in a session for a given project, the results get added to the leaderboard. This could be why we are getting different results. Also, we can always ask Erin LeDell (she's available in our Slack Channel).
@AlbertoAlmuinha please feel free to submit a pull request once you have your codes ready for review.
@mdancho84 Sure, count on a PR by the end of this week (possibly sooner but I don't want to get my fingers caught).
@AlbertoAlmuinha I think you are getting different results because you could not run the XGBoost model. H2O XGBoost does not run on windows. Web-based H2O Cluster (http://localhost:54321/flow) should be running when we use h2o.init().
Here's an example that I've adapted from @AlbertoAlmuinha's work and @Shafi2016's MVP examples. Take a look and let me know what you think.
Notes:
modeltime.gluonts
save/load functions. You may want to change the file path to relocate your H2O model & workflow objects.# LIBRARIES ----
library(tidymodels)
library(modeltime.h2o)
library(tidyverse)
library(timetk)
# DATA ----
data_tbl <- walmart_sales_weekly %>%
select(id, Date, Weekly_Sales)
# PREP ----
splits <- timetk::time_series_split(data_tbl, assess = "3 month", cumulative = TRUE)
recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
step_timeseries_signature(Date)
train_tbl <- rsample::training(splits) %>% bake(prep(recipe_spec), .)
test_tbl <- rsample::testing(splits) %>% bake(prep(recipe_spec), .)
# H2O INIT ----
h2o.init(
nthreads = -1,
ip = 'localhost',
port = 54321
)
# MODEL SPEC ----
model_spec <- automl_reg(mode = 'regression') %>%
parsnip::set_engine(
engine = 'h2o',
max_runtime_secs = 30,
max_runtime_secs_per_model = 30,
project_name = 'project_01',
nfolds = 5,
max_models = 1000,
exclude_algos = c("DeepLearning"),
seed = 786
)
model_spec
# TRAINING ----
model_fitted <- model_spec %>%
fit(Weekly_Sales ~ ., data = train_tbl)
model_fitted
# PREDICT ----
predict(model_fitted, test_tbl)
# MODELTIME ----
modeltime_table(
model_fitted
) %>%
modeltime_calibrate(test_tbl) %>%
modeltime_forecast(
new_data = test_tbl,
actual_data = data_tbl,
keep_data = TRUE
) %>%
group_by(id) %>%
plot_modeltime_forecast(.facet_ncol = 2)
# SAVE / LOAD ----
model_fitted %>% save_h2o_model(path = "../model_fitted", overwrite = TRUE)
# Test shutting down the h2o cluster and restarting it
h2o.shutdown(prompt = FALSE)
h2o.init(
nthreads = -1,
ip = 'localhost',
port = 54321
)
load_h2o_model(path = "../model_fitted/") %>% predict(test_tbl)
# REFIT ----
data_prepared_tbl <- bind_rows(train_tbl, test_tbl)
refit_tbl <- modeltime_table(
model_fitted
) %>%
modeltime_refit(data_prepared_tbl)
# FUTURE FORECAST -----
future_tbl <- data_prepared_tbl %>%
group_by(id) %>%
future_frame(.length_out = "1 year") %>%
ungroup()
future_prepared_tbl <- bake(prep(recipe_spec), future_tbl)
refit_tbl %>%
modeltime_forecast(
new_data = future_prepared_tbl,
actual_data = data_prepared_tbl,
keep_data = TRUE
) %>%
group_by(id) %>%
plot_modeltime_forecast(.facet_ncol = 2)
We can likely do better than this if we train longer but good for a quick example.
@AlbertoAlmuinha Regarding reproducibility -- H2O AutoML will only be reproducible if the following criteria are met (your code is doing 1 & 2 but not 3):
exclude_algos = c("DeepLearning")
. H2O DNNs use a speed-up technique called HOGWILD!, which is not reproducible unless H2O is used on a single core. Using a single core would not be very efficient, so under "normal" (multi-core) circumstances, H2O DNNs are not reproducible and we need to turn them off.max_models
and do not use max_runtime_secs
or max_runtime_secs_per_model
. If you limit by runtime, each run may be able to perform a slightly different amount training (depends on what else is running on your machine at the time) and therefore the set of AutoML models may be slightly different. If you use max_models
, then we know exactly how many models can be trained and the same ones will be trained each time (assuming you set a seed).Objective is to forecast using additional predictors including date-related predictors.
recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>% step_timeseries_signature(Date)
The code above has predictors related to Date. I’m assuming that there is another predictor say X.
If one needs to forecast 1-year ahead after refitting model(s) using full data set (training + testing), then one needs also future 1-year X values as well as predictors related to Date. It is needed to collect/prepare future 1-year X values as well. Any idea please or any implementation in modeltime to simulate/predict 1-year X values.
Here's a minimal example based on Shafi's code. We can convert this into
tidymodels
format once we agree on the process being shown.Created on 2021-03-08 by the reprex package (v1.0.0)