modeltime_fit_resamples() automatically 'lags' the .row column

business-science / modeltime.resample

Resampling Tools for Time Series Forecasting with Modeltime

Other

19 stars 5 forks source link

The out-of-sample predictions generated by modeltime_fit_resamples() incorrectly show the predicted variable as if it was lagged.

When I run the following code it outputs out-of-sample projections, as expected :

  submodels_resamples_tscv_tbl <- submodels_tbl %>% 
    modeltime_fit_resamples(
      resamples    = cv_resamples
    )

However, a closer inspection of the output provided by modeltime_fit_resamples() shows that the predicted variable is not indexed by the same ".row" (id column) as it was inside "cv_resamples"


# Input received by modeltime_fit_resamples()
input <- cv_resamples %>%
      filter(id == "Slice01") %>% 
      pull(splits) %>%
      first() %>% 
      training() %>% 
      select(.row, value)

input %>% filter(.row >= 1816, .row <= 1825 ) %>% print()

# Output generated by modeltime_fit_resamples()

output <- submodels_resamples_tscv_tbl %>% 
    select(.model_desc, .resample_results) %>% 
    unnest(.resample_results) %>% 
    select(.model_desc, id, .predictions) %>% 
    unnest(.predictions)

output %>% select(.row, value) %>%  filter(.row >= 1816, .row <= 1825 ) %>% print()

If we plot those variables, we can clearly see that the test set was lagged by a few days. The first slice inside 'cv_resamples' includes the test and the training set, both depicted in blue. The output, includes the out-of-sample projections for the test set but we ignore those and plot only the realized value of the target variable, in red. I include the code below just for completeness.

ggplot() +
      geom_line(data = output, aes(x = .row, y = value, color = "input")) +
      geom_line(data = input,  aes(x = .row, y = value, color = "output"))

library(tidymodels) library(modeltime) library(modeltime.resample) library(timetk) library(tidyverse) library(tidyquant) full_data_tbl <- walmart_sales_weekly %>% select(id, Date, Weekly_Sales) %>% # Apply Group-wise Time Series Manipulations group_by(id) %>% future_frame( .date_var = Date, .length_out = "3 months", .bind_data = TRUE ) %>% ungroup() %>% # Consolidate IDs mutate(id = fct_drop(id)) # Training Data data_prepared_tbl <- full_data_tbl %>% filter(!is.na(Weekly_Sales)) # Forecast Data future_tbl <- full_data_tbl %>% filter(is.na(Weekly_Sales)) walmart_tscv <- data_prepared_tbl %>% time_series_cv( date_var = Date, assess = "3 months", skip = "3 months", cumulative = TRUE, slice_limit = 6 ) walmart_tscv %>% tk_time_series_cv_plan() %>% plot_time_series_cv_plan(Date, Weekly_Sales, .facet_ncol = 2, .interactive = F) recipe_spec <- recipe(Weekly_Sales ~ ., data = training(walmart_tscv$splits[[1]])) %>% step_timeseries_signature(Date) %>% step_rm(matches("(.iso$)|(.xts$)|(day)|(hour)|(minute)|(second)|(am.pm)")) %>% step_mutate(Date_week = factor(Date_week, ordered = TRUE)) %>% step_dummy(all_nominal(), one_hot = TRUE) wflw_fit_xgboost <- workflow() %>% add_model( boost_tree() %>% set_engine("xgboost") ) %>% add_recipe(recipe_spec %>% step_rm(Date)) %>% fit(training(walmart_tscv$splits[[1]])) model_tbl <- modeltime_table( wflw_fit_xgboost ) resample_results <- model_tbl %>% modeltime_fit_resamples( resamples = walmart_tscv, control = control_resamples(verbose = FALSE) ) input <- walmart_tscv %>% filter(id == "Slice1") %>% pull(splits) %>% pluck(1) %>% training() input_2<-resample_results$.resample_results[[1]] %>% filter(id == "Slice1") %>% pull(splits) %>% pluck(1) %>% training() identical(input, input_2)

business-science / modeltime.resample

modeltime_fit_resamples() automatically 'lags' the .row column #10