recursive() workflow - xgboost failure

Teett commented 3 years ago

Hi. I've been experimenting with the new recursive() function quite a lot for panel data and I think the calibration process might not be working properly when using a recipe specification, maybe I'm doing it incorrectly though. Leaving a reproducible example here:

FORECAST_HORIZON <- 24

m4_extended <- m4_monthly %>%
    group_by(id) %>%
    future_frame(
        .length_out = FORECAST_HORIZON,
        .bind_data  = TRUE
    ) %>%
    ungroup()

# TRANSFORM FUNCTION ----
# - NOTE - We create lags by group
lag_roll_transformer_grouped <- function(data){
    data %>%
        group_by(id) %>%
        tk_augment_lags(value, .lags = 1:FORECAST_HORIZON) %>%
        tk_augment_slidify(
          .value   = contains("lag12"),
          .f       = ~mean(.x, na.rm = T),
          .period  = c(12),
          .partial = TRUE
        ) %>%
        ungroup()
}

m4_lags <- m4_extended %>%
    lag_roll_transformer_grouped()

train_data <- m4_lags %>%
    drop_na()

future_data <- m4_lags %>%
    filter(is.na(value))

splits <- train_data %>%
  time_series_split(date, assess = FORECAST_HORIZON, cumulative = TRUE)

xgb_spec <- boost_tree(mode = "regression",
           learn_rate = 0.35) %>% 
  set_engine("xgboost")

recipe_spec <- recipe(value ~ ., data = training(splits)) %>% 
  step_timeseries_signature(date) %>% 
  step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(am.pm)")) %>%
  step_rm(date) %>% 
  step_normalize(date_index.num) %>% 
  step_dummy(all_nominal(), one_hot = TRUE)

# Recipe diagnostics
recipe_spec %>% summary()
## After the recipe
prep(recipe_spec) %>% juice() %>% glimpse()

# Modeling Autoregressive Panel Data
set.seed(123)
model_fit_lm_recursive <- workflow() %>% 
  add_model(xgb_spec) %>%
  add_recipe(recipe_spec) %>% 
  fit(
        data = training(splits)
    ) %>%
    recursive(
        id         = "id", # We add an id = "id" to specify the groups
        transform  = lag_roll_transformer_grouped,
        # We use panel_tail() to grab tail by groups
        train_tail = panel_tail(train_data, id, FORECAST_HORIZON)
    )

# Models table
models_tbl <- modeltime_table(
    model_fit_lm_recursive
) 

# Calibrate

calibration_tbl <- models_tbl %>%
    modeltime_calibrate(new_data = testing(splits),
                        quiet = FALSE)

During the Calibrate process the following error appears:

Error: 

-- Model Calibration Failure Report ------------------------
# A tibble: 1 x 4
  .model_id .model     .model_desc .nested.col
      <int> <list>     <chr>       <lgl>      
1         1 <rcrsv_pn> XGBOOST     NA         
All models failed Modeltime Calibration:
- Model 1: Failed Calibration.

Potential Solution: Use `modeltime_calibrate(quiet = FALSE)` AND Check the Error/Warning Messages for clues as to why your model(s) failed calibration.
-- End Model Calibration Failure Report --------------------

Error: All models failed Modeltime Calibration.
Run `rlang::last_error()` to see where the error occurred.

I've noticed the .model in the models_tbl changes to from <fit[+]> when using a workflow with a recipe. Maybe this is a clue?

mdancho84 commented 3 years ago

From what I can tell we are losing the "id" column somewhere along the way. Will need to investigate further.

AlbertoAlmuinha commented 3 years ago

Hi @mdancho84 ,

The problem is in predict_recursive_panel_workflow function, the issue is related with these lines:

    blueprint <- workflow$pre$mold$blueprint
    forged    <- hardhat::forge(new_data, blueprint)
    new_data  <- forged$predictors

and the problem is basically due to the fact that in the recipe the variable "id" is being eliminated by using the step step_dummy(all_nominal(), one_hot = TRUE), which converts the variable id into id_M1, id_M1000 etc.

We need to think of a way to deal with this situation.

Regards,

mdancho84 commented 3 years ago

Great observation @AlbertoAlmuinha. The latest commit should fix it. The problem is xgboost is really picky with how the columns need to be provided.

Here's a working example:

# Libraries & Setup ----
library(modeltime)
library(tidymodels)
library(tidyverse)
library(lubridate)
library(timetk)
library(slider)

FORECAST_HORIZON <- 24

m4_extended <- m4_monthly %>%
    group_by(id) %>%
    future_frame(
        .length_out = FORECAST_HORIZON,
        .bind_data  = TRUE
    ) %>%
    ungroup()
#> .date_var is missing. Using: date

# TRANSFORM FUNCTION ----
# - NOTE - We create lags by group
lag_roll_transformer_grouped <- function(data){
    data %>%
        group_by(id) %>%
        tk_augment_lags(value, .lags = 1:FORECAST_HORIZON) %>%
        tk_augment_slidify(
            .value   = contains("lag12"),
            .f       = ~mean(.x, na.rm = T),
            .period  = c(12, 24, 36),
            .partial = TRUE
        ) %>%
        ungroup()
}

m4_lags <- m4_extended %>%
    lag_roll_transformer_grouped()

train_data <- m4_lags %>%
    drop_na()

future_data <- m4_lags %>%
    filter(is.na(value))

splits <- train_data %>%
    time_series_split(date, assess = FORECAST_HORIZON, cumulative = TRUE)

xgb_spec <- boost_tree(
        mode       = "regression",
        learn_rate = 0.35
    ) %>% 
    set_engine("xgboost")

recipe_spec <- recipe(value ~ ., data = training(splits)) %>% 
    step_timeseries_signature(date) %>% 
    step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(am.pm)")) %>%
    step_rm(date) %>% 
    step_normalize(date_index.num) %>% 
    step_dummy(all_nominal(), one_hot = TRUE)

# Modeling Autoregressive Panel Data
set.seed(123)
model_fit_xgb_recursive <- workflow() %>% 
    add_model(xgb_spec) %>%
    add_recipe(recipe_spec) %>% 
    fit(training(splits)) %>%
    recursive(
        id         = "id", # We add an id = "id" to specify the groups
        transform  = lag_roll_transformer_grouped,
        # We use panel_tail() to grab tail by groups
        train_tail = panel_tail(training(splits), id, FORECAST_HORIZON)
    )

# Models table
models_tbl <- modeltime_table(
    model_fit_xgb_recursive
) 

# Calibrate

calibration_tbl <- models_tbl %>%
    modeltime_calibrate(new_data = testing(splits),
                        quiet = FALSE)

calibration_tbl %>%
    modeltime_forecast(
        new_data    = testing(splits), 
        actual_data = m4_lags %>% drop_na(),
        keep_data   = TRUE
    ) %>%
    group_by(id) %>%
    plot_modeltime_forecast(.interactive = FALSE, .facet_ncol = 2)

^{Created on 2021-03-22 by the reprex package (v1.0.0)}

mdancho84 commented 3 years ago

I believe this is resolved. I'm closing now.

business-science / modeltime

recursive() workflow - xgboost failure #75