business-science / modeltime.h2o

Forecasting with H2O AutoML. Use the H2O Automatic Machine Learning algorithm as a backend for Modeltime Time Series Forecasting.
https://business-science.github.io/modeltime.h2o/
Other
38 stars 11 forks source link

Develop H2O Regression Algorithms #2

Open mdancho84 opened 3 years ago

mdancho84 commented 3 years ago

Here's a minimal example based on Shafi's code. We can convert this into tidymodels format once we agree on the process being shown.

# MVP EXAMPLE ----
# automl_reg() function

# Libraries ----
library(modeltime)
library(tidymodels)
library(h2o)
library(tidyverse)
library(timetk)

# Data ----
# - This is before modeltime

data_tbl <- walmart_sales_weekly %>% 
    select(id, Date, Weekly_Sales)

splits <- timetk::time_series_split(data_tbl, assess = "3 month", cumulative = TRUE)

recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
    step_timeseries_signature(Date) 

train_tbl <- rsample::training(splits) %>% bake(prep(recipe_spec), .)
test_tbl  <- rsample::testing(splits) %>% bake(prep(recipe_spec), .)

# H2O Initialization ----
# - User will set up H2O 
h2o.init(
    max_mem_size = "1000G", 
    nthreads = -1, 
    ip = "localhost", 
    port = 54321
)
#>  Connection successful!
#> 
#> R is connected to the H2O cluster: 
#>     H2O cluster uptime:         2 days 21 hours 
#>     H2O cluster timezone:       America/New_York 
#>     H2O data parsing timezone:  UTC 
#>     H2O cluster version:        3.32.0.1 
#>     H2O cluster version age:    4 months and 27 days !!! 
#>     H2O cluster name:           H2O_started_from_R_mdancho_gyx565 
#>     H2O cluster total nodes:    1 
#>     H2O cluster total memory:   7.90 GB 
#>     H2O cluster total cores:    12 
#>     H2O cluster allowed cores:  12 
#>     H2O cluster healthy:        TRUE 
#>     H2O Connection ip:          localhost 
#>     H2O Connection port:        54321 
#>     H2O Connection proxy:       NA 
#>     H2O Internal Security:      FALSE 
#>     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4 
#>     R Version:                  R version 4.0.2 (2020-06-22)
#> Warning in h2o.clusterInfo(): 
#> Your H2O cluster version is too old (4 months and 27 days)!
#> Please download and install the latest version from http://h2o.ai/download/

# MODELTING WORKFLOW ----
# - This is where Modeltime H2O takes over

# Spec - Package API will handle with 

# * automl_reg() %>% set_engine("h2o") ----
# - I doubt this function needs hyperparams
# - Users can use set_engine() to specify any args

# * fit() ----
# - Will handle preparing as H2O Frame, training the automl, storing either a leaderboard or a subset of models

# ** Prep data 
train_tbl <- train_tbl %>%
    # H2O doesn't like ordered factors
    mutate_if(is.ordered, function(x) factor(x, ordered = FALSE))

# ** Convert to H2O Frame
train_h2o <- as.h2o(train_tbl)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

y <- "Weekly_Sales"
x <- setdiff(names(train_h2o), y)

aml_results <- h2o.automl(
    x = x, y = y, 

    # Data Specifications - 
    # - I recommend to only use a Training Frame 
    # - This lets CV do the validation
    training_frame = train_h2o, 
    # validation_frame = valid,
    # leaderboard_frame = test, 

    # User Defined Args
    max_runtime_secs = 30, 
    max_runtime_secs_per_model = 30,

    project_name = 'project_01',

    nfolds        = 5,
    max_models    = 1000,
    exclude_algos = c("DeepLearning"),
    seed          =  786
)
#>   |                                                                              |                                                                      |   0%
#> 06:58:55.409: New models will be added to existing leaderboard project_01@@Weekly_Sales (leaderboard frame=null) with already 23 models.
#> 06:59:06.476: StackedEnsemble_BestOfFamily_AutoML_20210308_065855 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 06:59:07.481: StackedEnsemble_AllModels_AutoML_20210308_065855 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:09:49.460: New models will be added to existing leaderboard project_01@@Weekly_Sales (leaderboard frame=null) with already 33 models.
#> 07:10:00.529: StackedEnsemble_BestOfFamily_AutoML_20210308_070949 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:10:01.536: StackedEnsemble_AllModels_AutoML_20210308_070949 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:19:21.102: New models will be added to existing leaderboard project_01@@Weekly_Sales (leaderboard frame=null) with already 43 models.
#> 07:19:32.172: StackedEnsemble_BestOfFamily_AutoML_20210308_071921 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:19:33.178: StackedEnsemble_AllModels_AutoML_20210308_071921 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:20:19.518: New models will be added to existing leaderboard project_01@@Weekly_Sales (leaderboard frame=null) with already 53 models.
#> 07:20:46.723: StackedEnsemble_BestOfFamily_AutoML_20210308_072019 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:20:47.728: StackedEnsemble_AllModels_AutoML_20210308_072019 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:31:42.782: New models will be added to existing leaderboard project_01@@Weekly_Sales (leaderboard frame=null) with already 108 models.
#> 07:32:10.987: StackedEnsemble_BestOfFamily_AutoML_20210308_073142 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:32:11.993: StackedEnsemble_AllModels_AutoML_20210308_073142 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:33:52.726: New models will be added to existing leaderboard project_01@@Weekly_Sales (leaderboard frame=null) with already 148 models.  |                                                                              |=======                                                               |  10%  |                                                                              |===========                                                           |  16%  |                                                                              |===============                                                       |  21%  |                                                                              |==================                                                    |  26%  |                                                                              |======================                                                |  32%  |                                                                              |==========================                                            |  37%  |                                                                              |=============================                                         |  42%  |                                                                              |=================================                                     |  47%  |                                                                              |====================================                                  |  52%  |                                                                              |========================================                              |  57%  |                                                                              |============================================                          |  62%  |                                                                              |===============================================                       |  67%  |                                                                              |===================================================                   |  73%  |                                                                              |=======================================================               |  78%  |                                                                              |===========================================================           |  84%  |                                                                              |===============================================================       |  90%  |                                                                              |==================================================================    |  95%
#> 07:34:20.984: StackedEnsemble_BestOfFamily_AutoML_20210308_073352 [StackedEnsemble best (built using top model from each algorithm type)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.
#> 07:34:21.989: StackedEnsemble_AllModels_AutoML_20210308_073352 [StackedEnsemble all (built using all AutoML models)] failed: water.exceptions.H2OIllegalArgumentException: Failed to find the xval predictions frame. . .  Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted.  |                                                                              |======================================================================| 100%

# Returns many models organized by a Leaderboard
aml_results
#> AutoML Details
#> ==============
#> Project Name: project_01 
#> Leader Model ID: XGBoost_grid__1_AutoML_20210308_073352_model_17 
#> Algorithm: xgboost 
#> 
#> Total Number of Models Trained: 190 
#> Start Time: 2021-03-08 07:33:53 UTC 
#> End Time: 2021-03-08 07:34:22 UTC 
#> Duration: 29 s
#> 
#> Leaderboard
#> ===========
#>                                           model_id mean_residual_deviance
#> 1  XGBoost_grid__1_AutoML_20210308_073352_model_17               32430887
#> 2  XGBoost_grid__1_AutoML_20210308_073142_model_17               32430887
#> 3   XGBoost_grid__1_AutoML_20210308_073352_model_3               34331133
#> 4   XGBoost_grid__1_AutoML_20210308_073142_model_3               34331133
#> 5  XGBoost_grid__1_AutoML_20210308_073352_model_10               35345362
#> 6  XGBoost_grid__1_AutoML_20210308_073142_model_10               35345362
#> 7                     GBM_1_AutoML_20210308_073352               35463702
#> 8                     GBM_1_AutoML_20210308_073142               35463702
#> 9  XGBoost_grid__1_AutoML_20210308_073352_model_20               35975536
#> 10                XGBoost_3_AutoML_20210308_073352               36178176
#>        rmse      mse      mae     rmsle
#> 1  5694.812 32430887 3284.597 0.1272254
#> 2  5694.812 32430887 3284.597 0.1272254
#> 3  5859.278 34331133 3628.706 0.1437346
#> 4  5859.278 34331133 3628.706 0.1437346
#> 5  5945.197 35345362 3672.445 0.1501596
#> 6  5945.197 35345362 3672.445 0.1501596
#> 7  5955.141 35463702 3634.803 0.1482343
#> 8  5955.141 35463702 3634.803 0.1482343
#> 9  5997.961 35975536 3649.308 0.1425722
#> 10 6014.830 36178176 3731.129 0.1543888
#> 
#> [190 rows x 6 columns]

# View the AutoML Leaderboard
lb <- aml_results@leaderboard

# Get the best model ID
model_id_lb_1 <- as_tibble(lb) %>% slice(1) %>% pull(model_id) 

# * predict() ----

# ** Prep data 
test_tbl <- test_tbl %>%
    # H2O doesn't like ordered factors
    mutate_if(is.ordered, function(x) factor(x, ordered = FALSE))

test_h2o   <- as.h2o(test_tbl)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

model_lb_1 <- h2o.getModel(model_id_lb_1)

preds <- predict(model_lb_1, test_h2o) %>% as_tibble() %>% pull(predict)
#>   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

# Results ----
test_tbl %>%
    mutate(preds = preds) %>%
    pivot_longer(cols = c(Weekly_Sales, preds)) %>%
    group_by(id) %>%
    plot_time_series(
        Date, value, .color_var = name, 
        .smooth = FALSE,
        .facet_ncol = 2,
        .interactive = FALSE
    )

Created on 2021-03-08 by the reprex package (v1.0.0)

Shafi2016 commented 3 years ago

If we want to extract other models from the leaderboard. We can do as

model_ids <- as.vector(aml_results@leaderboard$model_id)

# to get second model from leaderboard
index <- 2
model_2 <- h2o.getModel(model_ids[index])

preds <- predict(model_2 , test_h2o) %>% as_tibble() %>% pull(predict)
AlbertoAlmuinha commented 3 years ago

I've made a proof of concept and I have implemented a very first version of automl_reg (still without the predict functionality). The problem is that I have realized that I am not being able to reproduce the results given in this issue, because the best model I get does not match this best model (which should not happen because a seed is being used). I have observed that when I execute the following code (and also the original copied in this issue):

model <- automl_reg(mode = 'regression') %>%
    parsnip::set_engine('h2o',
               max_runtime_secs = 30, 
               max_runtime_secs_per_model = 30,
               project_name = 'project_01',
               nfolds        = 5,
               max_models    = 1000,
               exclude_algos = c("DeepLearning"),
               seed          =  786) %>%
    parsnip::fit(Weekly_Sales ~ ., data = train_tbl)

some messages appear that I can't see in Matt's examle ...For example:

1 - "AutoML: XGBoost is not available; skipping it" 2 - "Failed to find the xval predictions frame. . . Looks like keep_cross_validation_predictions wasn't set when building the models, or the frame was deleted."

could you share your sessionInfo to see what is happening? I also attach the results I get. aml_results

mdancho84 commented 3 years ago

@AlbertoAlmuinha I'm not super concerned about the reproducibility. We can solve this down the road. A couple of reasons for the differences - Each time your run AutoML in a session for a given project, the results get added to the leaderboard. This could be why we are getting different results. Also, we can always ask Erin LeDell (she's available in our Slack Channel).

mdancho84 commented 3 years ago

@AlbertoAlmuinha please feel free to submit a pull request once you have your codes ready for review.

AlbertoAlmuinha commented 3 years ago

@mdancho84 Sure, count on a PR by the end of this week (possibly sooner but I don't want to get my fingers caught).

Shafi2016 commented 3 years ago

@AlbertoAlmuinha I think you are getting different results because you could not run the XGBoost model. H2O XGBoost does not run on windows. Web-based H2O Cluster (http://localhost:54321/flow) should be running when we use h2o.init().

mdancho84 commented 3 years ago

H2O AutoML Workflow

Here's an example that I've adapted from @AlbertoAlmuinha's work and @Shafi2016's MVP examples. Take a look and let me know what you think.

Notes:

# LIBRARIES ----

library(tidymodels)
library(modeltime.h2o)
library(tidyverse)
library(timetk)

# DATA ----

data_tbl <- walmart_sales_weekly %>%
    select(id, Date, Weekly_Sales)

# PREP ----

splits <- timetk::time_series_split(data_tbl, assess = "3 month", cumulative = TRUE)

recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>%
    step_timeseries_signature(Date)

train_tbl <- rsample::training(splits) %>% bake(prep(recipe_spec), .)
test_tbl  <- rsample::testing(splits) %>% bake(prep(recipe_spec), .)

# H2O INIT ----

h2o.init(
    nthreads = -1,
    ip = 'localhost',
    port = 54321
)

# MODEL SPEC ----

model_spec <- automl_reg(mode = 'regression') %>%
    parsnip::set_engine(
        engine                      = 'h2o',
         max_runtime_secs           = 30, 
         max_runtime_secs_per_model = 30,
         project_name               = 'project_01',
         nfolds                     = 5,
         max_models                 = 1000,
         exclude_algos              = c("DeepLearning"),
         seed                       =  786
    ) 

model_spec

# TRAINING ----

model_fitted <- model_spec %>%
    fit(Weekly_Sales ~ ., data = train_tbl)

model_fitted

# PREDICT ----

predict(model_fitted, test_tbl)

# MODELTIME ----

modeltime_table(
    model_fitted
) %>%
    modeltime_calibrate(test_tbl) %>%
    modeltime_forecast(
        new_data    = test_tbl,
        actual_data = data_tbl,
        keep_data   = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(.facet_ncol = 2)

# SAVE / LOAD ----

model_fitted %>% save_h2o_model(path = "../model_fitted", overwrite = TRUE)

# Test shutting down the h2o cluster and restarting it
h2o.shutdown(prompt = FALSE) 
h2o.init(
    nthreads = -1,
    ip = 'localhost',
    port = 54321
)

load_h2o_model(path = "../model_fitted/") %>% predict(test_tbl)

# REFIT ----

data_prepared_tbl <- bind_rows(train_tbl, test_tbl)

refit_tbl <- modeltime_table(
    model_fitted
) %>%
    modeltime_refit(data_prepared_tbl)

# FUTURE FORECAST -----

future_tbl <- data_prepared_tbl %>%
    group_by(id) %>%
    future_frame(.length_out = "1 year") %>%
    ungroup()

future_prepared_tbl <- bake(prep(recipe_spec), future_tbl)

refit_tbl %>%
    modeltime_forecast(
        new_data    = future_prepared_tbl,
        actual_data = data_prepared_tbl,
        keep_data   = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(.facet_ncol = 2)

We can likely do better than this if we train longer but good for a quick example.

image

ledell commented 3 years ago

@AlbertoAlmuinha Regarding reproducibility -- H2O AutoML will only be reproducible if the following criteria are met (your code is doing 1 & 2 but not 3):

  1. Set exclude_algos = c("DeepLearning"). H2O DNNs use a speed-up technique called HOGWILD!, which is not reproducible unless H2O is used on a single core. Using a single core would not be very efficient, so under "normal" (multi-core) circumstances, H2O DNNs are not reproducible and we need to turn them off.
  2. Set a seed.
  3. Use only max_models and do not use max_runtime_secs or max_runtime_secs_per_model. If you limit by runtime, each run may be able to perform a slightly different amount training (depends on what else is running on your machine at the time) and therefore the set of AutoML models may be slightly different. If you use max_models, then we know exactly how many models can be trained and the same ones will be trained each time (assuming you set a seed).
dmresearch15 commented 1 year ago

Objective is to forecast using additional predictors including date-related predictors.

recipe_spec <- recipe(Weekly_Sales ~ ., data = training(splits)) %>% step_timeseries_signature(Date)

The code above has predictors related to Date. I’m assuming that there is another predictor say X.

FUTURE FORECAST -----

If one needs to forecast 1-year ahead after refitting model(s) using full data set (training + testing), then one needs also future 1-year X values as well as predictors related to Date. It is needed to collect/prepare future 1-year X values as well. Any idea please or any implementation in modeltime to simulate/predict 1-year X values.