business-science / modeltime.resample

Resampling Tools for Time Series Forecasting with Modeltime
https://business-science.github.io/modeltime.resample/
Other
19 stars 5 forks source link

modeltime_fit_resamples() automatically 'lags' the .row column #10

Open luzbarbosa opened 2 years ago

luzbarbosa commented 2 years ago

The out-of-sample predictions generated by modeltime_fit_resamples() incorrectly show the predicted variable as if it was lagged.

When I run the following code it outputs out-of-sample projections, as expected :

  submodels_resamples_tscv_tbl <- submodels_tbl %>% 
    modeltime_fit_resamples(
      resamples    = cv_resamples
    )

However, a closer inspection of the output provided by modeltime_fit_resamples() shows that the predicted variable is not indexed by the same ".row" (id column) as it was inside "cv_resamples"


# Input received by modeltime_fit_resamples()
input <- cv_resamples %>%
      filter(id == "Slice01") %>% 
      pull(splits) %>%
      first() %>% 
      training() %>% 
      select(.row, value)

input %>% filter(.row >= 1816, .row <= 1825 ) %>% print()

image

# Output generated by modeltime_fit_resamples()

output <- submodels_resamples_tscv_tbl %>% 
    select(.model_desc, .resample_results) %>% 
    unnest(.resample_results) %>% 
    select(.model_desc, id, .predictions) %>% 
    unnest(.predictions)

output %>% select(.row, value) %>%  filter(.row >= 1816, .row <= 1825 ) %>% print()

image

If we plot those variables, we can clearly see that the test set was lagged by a few days. The first slice inside 'cv_resamples' includes the test and the training set, both depicted in blue. The output, includes the out-of-sample projections for the test set but we ignore those and plot only the realized value of the target variable, in red. I include the code below just for completeness.

ggplot() +
      geom_line(data = output, aes(x = .row, y = value, color = "input")) +
      geom_line(data = input,  aes(x = .row, y = value, color = "output")) 

image

AlbertoAlmuinha commented 2 years ago

Hi @luzbarbosa ,

Could you provide a reproducible example?

It is very complicated to be able to help you without knowing how you have generated the resamples or how you have reached those results.

In the tests that I have done, I do not see that the variable is lagging and both show that they are identical. Here is the code I used to test it in case it could be of any use to you:

library(tidymodels)
library(modeltime)
library(modeltime.resample)
library(timetk)
library(tidyverse)
library(tidyquant)

full_data_tbl <- walmart_sales_weekly %>%
  select(id, Date, Weekly_Sales) %>%

  # Apply Group-wise Time Series Manipulations
  group_by(id) %>%
  future_frame(
    .date_var   = Date,
    .length_out = "3 months",
    .bind_data  = TRUE
  ) %>%
  ungroup() %>%

  # Consolidate IDs
  mutate(id = fct_drop(id))

# Training Data
data_prepared_tbl <- full_data_tbl %>%
  filter(!is.na(Weekly_Sales))

# Forecast Data
future_tbl <- full_data_tbl %>%
  filter(is.na(Weekly_Sales))

walmart_tscv <- data_prepared_tbl %>%
  time_series_cv(
    date_var    = Date, 
    assess      = "3 months",
    skip        = "3 months",
    cumulative  = TRUE,
    slice_limit = 6
  )

walmart_tscv %>%
  tk_time_series_cv_plan() %>%
  plot_time_series_cv_plan(Date, Weekly_Sales, 
                           .facet_ncol = 2, .interactive = F)

recipe_spec <- recipe(Weekly_Sales ~ ., 
                      data = training(walmart_tscv$splits[[1]])) %>%
  step_timeseries_signature(Date) %>%
  step_rm(matches("(.iso$)|(.xts$)|(day)|(hour)|(minute)|(second)|(am.pm)")) %>%
  step_mutate(Date_week = factor(Date_week, ordered = TRUE)) %>%
  step_dummy(all_nominal(), one_hot = TRUE)

wflw_fit_xgboost <- workflow() %>%
  add_model(
    boost_tree() %>% set_engine("xgboost") 
  ) %>%
  add_recipe(recipe_spec %>% step_rm(Date)) %>%
  fit(training(walmart_tscv$splits[[1]]))

model_tbl <- modeltime_table(
  wflw_fit_xgboost
)

resample_results <- model_tbl %>%
  modeltime_fit_resamples(
    resamples = walmart_tscv,
    control   = control_resamples(verbose = FALSE)
  )

input <- walmart_tscv %>%
  filter(id == "Slice1") %>% 
  pull(splits) %>%
  pluck(1) %>% 
  training() 

input_2<-resample_results$.resample_results[[1]] %>% 
  filter(id == "Slice1") %>% 
  pull(splits) %>%
  pluck(1) %>% 
  training() 

identical(input, input_2)

In case you need more help, please send a reprex.

Regards,