business-science / modeltime

Modeltime unlocks time series forecast models and machine learning in one framework
https://business-science.github.io/modeltime/
Other
508 stars 80 forks source link

Add Conformal Prediction Intervals #173

Open tackes opened 2 years ago

mdancho84 commented 1 year ago

This is interesting. I'm going to look into.

mdancho84 commented 1 year ago

Doing a bit of research now:

tackes commented 1 year ago

The conformal package doesnt mention anything specific to Time Series, so I do not know if it can be adapted to work with TS or not.
There are python packages (MAPIE & SKTIME) which has implemented Conformal Predictions for TS. MAPIE and also SKTIME

mdancho84 commented 1 year ago

OK, I'm also looking into these references:

mdancho84 commented 11 months ago

Revisiting this. It turns out that the probably R package now has int_conformal_X functions that can be used with tidymodels.

mitokic commented 11 months ago

Yes that would be awesome! +1

marcozanotti commented 11 months ago

Yes! Amazing! +1

mdancho84 commented 11 months ago

Exploration of Conformal Intervals with Probably

Summary of Findings

  1. probably::int_conformal_split() function can be used to calculate conformal intervals using the split method.
  2. probably does not implement a conformal interval for a "refitted model".
  3. modeltime's internal workflow can repurpose the conformal intervals after refitting the models to improve forecast accuracy on future unseen data.

It's also worth mentioning that the global confidence intervals that Modeltime produces by default are similar to the default conformal split method's residuals. I didn't see a big difference in the Walmart data set forecast. If anything Modeltime is more conservative.

Analysis Reprex

# MODELTIME CONFORMAL INTERVAL EVALUATION ----
# Goal: Assess using probably R package for conformal interval

# SETUP ----

library(tidymodels)
library(probably)
#> 
#> Attaching package: 'probably'
#> The following objects are masked from 'package:base':
#> 
#>     as.factor, as.ordered
library(modeltime)
library(timetk)
library(tidyverse)
library(plotly)
#> 
#> Attaching package: 'plotly'
#> The following object is masked from 'package:ggplot2':
#> 
#>     last_plot
#> The following object is masked from 'package:stats':
#> 
#>     filter
#> The following object is masked from 'package:graphics':
#> 
#>     layout

walmart_sales_tbl <- timetk::walmart_sales_weekly %>%
    select(id, Date, Weekly_Sales) %>%
    mutate(id = as_factor(id))

walmart_sales_tbl %>%
    group_by(id) %>%
    plot_time_series(
        Date, Weekly_Sales, 
        .facet_ncol  = 2,
        .interactive = FALSE,
    )


splits <- time_series_split(
    walmart_sales_tbl,
    assess = "1 year",
    cumulative = TRUE
)
#> Using date_var: Date
#> Data is not ordered by the 'date_var'. Resamples will be arranged by `Date`.
#> Overlapping Timestamps Detected. Processing overlapping time series together using sliding windows.

new_data_tbl <- walmart_sales_tbl %>%
    group_by(id) %>%
    future_frame(.length_out = "1 year") %>%
    ungroup()
#> .date_var is missing. Using: Date

recipe_ml <- recipe(Weekly_Sales ~ ., training(splits)) %>%
    step_timeseries_signature(Date) %>%
    step_rm(Date) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

# recipe_ml %>% prep() %>% juice() %>% glimpse()

model_xgb <- boost_tree("regression") %>%
    set_engine("xgboost")

wflw_fit_xgb <- workflow() %>%
    add_model(model_xgb) %>%
    add_recipe(recipe_ml) %>%
    fit(training(splits))

# wflw_fit_xgb %>% predict(testing(splits))

# CONFORMAL SPLIT ----

wflw_fit_xgb_conformal_split <- int_conformal_split(
    wflw_fit_xgb,
    testing(splits)
)

test_predictions <- wflw_fit_xgb_conformal_split %>%
    predict(testing(splits), level = 0.95)

training(splits) %>%
    bind_rows(
        testing(splits) %>%
            bind_cols(test_predictions)
    ) %>%
    group_by(id) %>%

    ggplot(aes(Date, Weekly_Sales, group = id)) +
    geom_line() +
    facet_wrap(~ id, ncol = 2, scales = "free_y") +
    geom_ribbon(aes(ymin = .pred_lower, ymax = .pred_upper),
                alpha = 0.5) +
    geom_line(aes(y = .pred), color = "red") +
    theme_light()
#> Warning: Removed 637 rows containing missing values (`geom_line()`).


# How to refit?

# - Solution is to carry the CI forward

# COMPARE TO MODELTIME'S DEFAULT CONF INTERVALS ----

calib_tbl <- modeltime_table(
    wflw_fit_xgb
) %>%
    modeltime_calibrate(testing(splits))

forecast_tbl <- calib_tbl %>%
    modeltime_forecast(
        new_data      = testing(splits),
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        keep_data     = TRUE
    )

forecast_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(.facet_ncol = 2, .interactive = F)
#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf
#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf


future_forecast_tbl <- calib_tbl %>%
    modeltime_refit(walmart_sales_tbl) %>%
    modeltime_forecast(
        new_data      = new_data_tbl,
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        keep_data     = TRUE
    )

future_forecast_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(.facet_ncol = 2, .interactive = F)
#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf

#> Warning in max(ids, na.rm = TRUE): no non-missing arguments to max; returning
#> -Inf


# CONCLUSIONS ----

# 1. probably::int_conformal_split() function can be used to calculate
#    conformal intervals using the split method
# 2. probably does not implement a conformal interval for a refitted model
# 3. modeltime's internal workflow can be used to repurpose the conformal
#    intervals after refitting (not currently implemented in probably)

Created on 2023-08-03 with reprex v2.0.2

mdancho84 commented 11 months ago

Upon further inspection, the process that modeltime uses is different than what probably uses. Due to differences in the workflow, it seems that it's not as simple of an integration. However, the calculations can be used to replicate the int_conformal_split() function.

PROBLEM:

The processes for conformal predictions could be implemented.

Modeltime

  1. Compute predictions and residuals on test set during modeltime_calibrate()
  2. Use residuals to compute Confidence Interval during modeltime_forecast()

Probably

  1. Create a conformal predictor: int_conformal_split()
  2. Use the conformal predictor to compute CI: predict()

SOLUTION:

I can implement the prediction methodology used within int_conformal_split():

https://github.com/tidymodels/probably/blob/c46326651109fb2ebd1b3762b3cb086cfb96ac88/R/conformal_infer_split.R#L99

predict.int_conformal_split <- function(object, new_data, level = 0.95, ...) {
  check_data(new_data, object$wflow)
  rlang::check_dots_empty()

  new_pred <- predict(object$wflow, new_data)

  alpha <- 1 - level
  q_ind <- ceiling(level * (object$n + 1))
  q_val <- object$resid[q_ind]

  new_pred$.pred_lower <- new_pred$.pred - q_val
  new_pred$.pred_upper <- new_pred$.pred + q_val
  new_pred
}
mdancho84 commented 11 months ago

The Conformal Split Method has been integrated into modeltime_forecast(). You can see it in action here. Not a whole lot of difference between the default quantile method.

# MODELTIME CONFORMAL INTERVAL EVALUATION ----
# Goal: Assess using probably R package for conformal interval

# SETUP ----

library(tidymodels)
library(probably)
library(modeltime)
library(timetk)
library(tidyverse)

walmart_sales_tbl <- timetk::walmart_sales_weekly %>%
    select(id, Date, Weekly_Sales) %>%
    mutate(id = as_factor(id))

walmart_sales_tbl %>%
    group_by(id) %>%
    plot_time_series(
        Date, Weekly_Sales,
        .facet_ncol  = 2,
        .interactive = FALSE,
    )


splits <- time_series_split(
    walmart_sales_tbl,
    assess = "1 year",
    cumulative = TRUE
)
#> Using date_var: Date
#> Data is not ordered by the 'date_var'. Resamples will be arranged by `Date`.
#> Overlapping Timestamps Detected. Processing overlapping time series together using sliding windows.

new_data_tbl <- walmart_sales_tbl %>%
    group_by(id) %>%
    future_frame(.length_out = "1 year") %>%
    ungroup()
#> .date_var is missing. Using: Date

recipe_ml <- recipe(Weekly_Sales ~ ., training(splits)) %>%
    step_timeseries_signature(Date) %>%
    step_rm(Date) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

model_xgb <- boost_tree("regression") %>%
    set_engine("xgboost")

wflw_fit_xgb <- workflow() %>%
    add_model(model_xgb) %>%
    add_recipe(recipe_ml) %>%
    fit(training(splits))

calib_tbl <- modeltime_table(
    wflw_fit_xgb
) %>%
    modeltime_calibrate(testing(splits), id = "id")

# CONFORMAL (DEFAULT) ----

forecast_tbl <- calib_tbl %>%
    modeltime_forecast(
        new_data      = testing(splits),
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_default",
        keep_data     = TRUE
    )

forecast_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = F,
        .title       = "Conformal Default"
    )


# NEW CONFORMAL SPLIT ----

forecast_tbl <- calib_tbl %>%
    modeltime_forecast(
        new_data      = testing(splits),
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_split",
        keep_data     = TRUE
    )

forecast_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = FALSE,
        .title       = "Conformal Split"
    )


# FUTURE FORECAST ON REFITTED MODEL WITH CONFORMAL SPLIT PREDICTION ----

future_forecast_tbl <- calib_tbl %>%
    modeltime_refit(walmart_sales_tbl) %>%
    modeltime_forecast(
        new_data      = new_data_tbl,
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_split",
        # conf_by_id    = TRUE,
        keep_data     = TRUE
    )

future_forecast_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = F,
        .title       = "Conformal Split Method"
    )

Created on 2023-08-03 with reprex v2.0.2

mdancho84 commented 10 months ago

I've added a new vignette on Conformal Prediction Intervals in Modeltime. Please let me know if any changes are needed.

https://business-science.github.io/modeltime/articles/modeltime-conformal-prediction.html

mdancho84 commented 10 months ago

Nested Forecasting: Conformal Prediction Intervals (With Example)

I've updaded modeltime 1.2.8.9000 (development version) to now include Conformal Prediction Intervals and print display throughout the process for the confidence interval and method used.

Here's an example of Nested Forecasting with Conformal Prediction Intervals:

# NESTED FORECASTING: CONFORMAL PREDICTION METHOD AND INTERVAL INTEGRATION ----

library(tidymodels)
library(modeltime)
library(tidyverse)
library(timetk)

data_tbl <- walmart_sales_weekly %>%
    select(id, Date, Weekly_Sales) %>%
    set_names(c("id", "date", "value"))

# 1.0 NESTED STRUCTURE ----
nested_data_tbl <- data_tbl %>%
    extend_timeseries(
        .id_var        = id,
        .date_var      = date,
        .length_future = 52
    ) %>%
    nest_timeseries(
        .id_var        = id,
        .length_future = 52,
        .length_actual = 52*2
    ) %>%
    split_nested_timeseries(
        .length_test = 52
    )

nested_data_tbl
#> # A tibble: 7 x 4
#>   id    .actual_data       .future_data      .splits        
#>   <fct> <list>             <list>            <list>         
#> 1 1_1   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>
#> 2 1_3   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>
#> 3 1_8   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>
#> 4 1_13  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>
#> 5 1_38  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>
#> 6 1_93  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>
#> 7 1_95  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]>

# 2.0 MODELS ----

rec_xgb <- recipe(value ~ ., extract_nested_train_split(nested_data_tbl)) %>%
    step_timeseries_signature(date) %>%
    step_rm(date) %>%
    step_zv(all_predictors()) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

wflw_xgb <- workflow() %>%
    add_model(boost_tree("regression") %>% set_engine("xgboost")) %>%
    add_recipe(rec_xgb)

# 3.0 CONFORMAL PREDICTION ----

nested_modeltime_tbl <-  nested_data_tbl %>%
    modeltime_nested_fit(

        # Add workflows
        wflw_xgb,

        conf_interval = 0.95,
        conf_method   = "conformal_split"
    )
#> Fitting models on training data... ====>-------------------------- 14% | ET...
#> Fitting models on training data... =========>--------------------- 29% | ET...
#> Fitting models on training data... =============>----------------- 43% | ET...
#> Fitting models on training data... =================>------------- 57% | ET...
#> Fitting models on training data... =====================>--------- 71% | ET...
#> Fitting models on training data... ==========================>---- 86% | ET...

# New: Conf Method and Interval NOW Shown in Print Output
nested_modeltime_tbl %>% extract_nested_test_forecast()
#> # Forecast Results
#> 
#> Trained on: .splits | Forecast Errors: [0] | Conf Method: conformal_split |
#> Conf Interval: 0.95
#> # A tibble: 1,092 x 8
#>    id    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi
#>    <fct>     <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl>
#>  1 1_1          NA ACTUAL      actual 2010-11-05 34239.       NA       NA
#>  2 1_1          NA ACTUAL      actual 2010-11-12 19549.       NA       NA
#>  3 1_1          NA ACTUAL      actual 2010-11-19 19553.       NA       NA
#>  4 1_1          NA ACTUAL      actual 2010-11-26 18820.       NA       NA
#>  5 1_1          NA ACTUAL      actual 2010-12-03 22518.       NA       NA
#>  6 1_1          NA ACTUAL      actual 2010-12-10 31498.       NA       NA
#>  7 1_1          NA ACTUAL      actual 2010-12-17 44913.       NA       NA
#>  8 1_1          NA ACTUAL      actual 2010-12-24 55931.       NA       NA
#>  9 1_1          NA ACTUAL      actual 2010-12-31 19125.       NA       NA
#> 10 1_1          NA ACTUAL      actual 2011-01-07 15984.       NA       NA
#> # i 1,082 more rows

# Visualize Test Forecast
nested_modeltime_tbl %>% 
    extract_nested_test_forecast() %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2,
        .interactive = FALSE
    )


# 4.0 SELECT BEST MODELS----

# Select Best Forecasters
best_nested_modeltime_tbl <- nested_modeltime_tbl %>% 
    modeltime_nested_select_best(metric = "rmse")

# New: Conf Method and Interval NOW Shown in Print Output
best_nested_modeltime_tbl
#> # Nested Modeltime Table
#> 
#> Trained on: .splits | Forecast Errors: [0] | Conf Method: conformal_split |
#> Conf Interval: 0.95
#> # A tibble: 7 x 5
#>   id    .actual_data       .future_data      .splits         .modeltime_tables 
#>   <fct> <list>             <list>            <list>          <list>            
#> 1 1_1   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 2 1_3   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 3 1_8   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 4 1_13  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 5 1_38  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 6 1_93  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 7 1_95  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>

# 5.0 REFIT ----

# Refitting reuses conformal method and interval
nested_modeltime_refit_tbl <- best_nested_modeltime_tbl %>%
    modeltime_nested_refit()
#> Fitting models on actual data... ====>-------------------------- 14% | ETA:...
#> Fitting models on actual data... =========>--------------------- 29% | ETA:...
#> Fitting models on actual data... =============>----------------- 43% | ETA:...
#> Fitting models on actual data... =================>------------- 57% | ETA:...
#> Fitting models on actual data... =====================>--------- 71% | ETA:...
#> Fitting models on actual data... ==========================>---- 86% | ETA:...

# New: Conf Method and Interval NOW Shown in Print Output
nested_modeltime_refit_tbl
#> # Nested Modeltime Table
#> 
#> Trained on: .actual_data | Forecast Errors: [0] | Conf Method: conformal_split
#> | Conf Interval: 0.95
#> # A tibble: 7 x 5
#>   id    .actual_data       .future_data      .splits         .modeltime_tables 
#>   <fct> <list>             <list>            <list>          <list>            
#> 1 1_1   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 2 1_3   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 3 1_8   <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 4 1_13  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 5 1_38  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 6 1_93  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>
#> 7 1_95  <tibble [104 x 2]> <tibble [52 x 2]> <split [52|52]> <mdl_tm_t [1 x 5]>

# Nested future forecast reuses conf method and interval
nested_modeltime_refit_tbl %>% extract_nested_future_forecast()
#> # Forecast Results
#> 
#> Trained on: .actual_data | Forecast Errors: [0] | Conf Method: conformal_split
#> | Conf Interval: 0.95
#> # A tibble: 1,092 x 8
#>    id    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi
#>    <fct>     <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl>
#>  1 1_1          NA ACTUAL      actual 2010-11-05 34239.       NA       NA
#>  2 1_1          NA ACTUAL      actual 2010-11-12 19549.       NA       NA
#>  3 1_1          NA ACTUAL      actual 2010-11-19 19553.       NA       NA
#>  4 1_1          NA ACTUAL      actual 2010-11-26 18820.       NA       NA
#>  5 1_1          NA ACTUAL      actual 2010-12-03 22518.       NA       NA
#>  6 1_1          NA ACTUAL      actual 2010-12-10 31498.       NA       NA
#>  7 1_1          NA ACTUAL      actual 2010-12-17 44913.       NA       NA
#>  8 1_1          NA ACTUAL      actual 2010-12-24 55931.       NA       NA
#>  9 1_1          NA ACTUAL      actual 2010-12-31 19125.       NA       NA
#> 10 1_1          NA ACTUAL      actual 2011-01-07 15984.       NA       NA
#> # i 1,082 more rows

# Visualize
nested_modeltime_refit_tbl %>%
    extract_nested_future_forecast() %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .interactive = FALSE,
        .facet_ncol  = 2
    )


# 6.0 FORECAST ADJUSTMENTS POST REFIT ----

# CHANGE CONF INTERVAL FROM 0.95 TO 0.80
# Modeltime Nested Forecast does NOT reuse Conf Interval and Method (user must specify) 
adjusting_conf_interval <- nested_modeltime_refit_tbl %>%
    modeltime_nested_forecast(
        h             = 52*2,             # CHANGE TO 2 YEARS
        conf_interval = 0.80,             # CHANGE TO 80% CI
        conf_method   = "conformal_split"
    ) 
#> Forecast predictions... =========>--------------------- 29% | ETA: 4s
#> Forecast predictions... =============>----------------- 43% | ETA: 3s
#> Forecast predictions... =================>------------- 57% | ETA: 2s
#> Forecast predictions... =====================>--------- 71% | ETA: 2s Forecast
#> predictions... ==========================>---- 86% | ETA: 1s

# New: Conf Method and Interval NOW Shown in Print Output
adjusting_conf_interval
#> # Forecast Results
#> 
#> Trained on: | Forecast Errors: [0] | Conf Method: conformal_split | Conf
#> Interval: 0.8
#> # A tibble: 1,456 x 8
#>    id    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi
#>    <fct>     <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl>
#>  1 1_1          NA ACTUAL      actual 2010-11-05 34239.       NA       NA
#>  2 1_1          NA ACTUAL      actual 2010-11-12 19549.       NA       NA
#>  3 1_1          NA ACTUAL      actual 2010-11-19 19553.       NA       NA
#>  4 1_1          NA ACTUAL      actual 2010-11-26 18820.       NA       NA
#>  5 1_1          NA ACTUAL      actual 2010-12-03 22518.       NA       NA
#>  6 1_1          NA ACTUAL      actual 2010-12-10 31498.       NA       NA
#>  7 1_1          NA ACTUAL      actual 2010-12-17 44913.       NA       NA
#>  8 1_1          NA ACTUAL      actual 2010-12-24 55931.       NA       NA
#>  9 1_1          NA ACTUAL      actual 2010-12-31 19125.       NA       NA
#> 10 1_1          NA ACTUAL      actual 2011-01-07 15984.       NA       NA
#> # i 1,446 more rows

# Visualize
adjusting_conf_interval %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .interactive = FALSE,
        .facet_ncol  = 2
    )

Created on 2023-09-04 with reprex v2.0.2

mdancho84 commented 10 months ago

Testing the Updated Display of Confidence Method & Interval

I've attempted to make it more obvious of what Conformal Prediction method is being used, the confidence interval, and whether or not it's a global confidence or local confidence interval (i.e. takes into account the specific ID of the group).

Example

# CONFORMAL FORECASTING ----
# Goal: New integration of Conformal Prediction

# SETUP ----

library(tidymodels)
library(modeltime)
library(timetk)
library(tidyverse)

walmart_sales_tbl <- timetk::walmart_sales_weekly %>%
    select(id, Date, Weekly_Sales) %>%
    mutate(id = as_factor(id))

walmart_sales_tbl %>%
    group_by(id) %>%
    plot_time_series(
        Date, Weekly_Sales,
        .facet_ncol  = 2,
        .interactive = FALSE,
    )


splits <- time_series_split(
    walmart_sales_tbl,
    assess = "1 year",
    cumulative = TRUE
)
#> Using date_var: Date
#> Data is not ordered by the 'date_var'. Resamples will be arranged by `Date`.
#> Overlapping Timestamps Detected. Processing overlapping time series together using sliding windows.

new_data_tbl <- walmart_sales_tbl %>%
    group_by(id) %>%
    future_frame(.length_out = "1 year") %>%
    ungroup()
#> .date_var is missing. Using: Date

recipe_ml <- recipe(Weekly_Sales ~ ., training(splits)) %>%
    step_timeseries_signature(Date) %>%
    step_rm(Date) %>%
    step_dummy(all_nominal_predictors(), one_hot = TRUE)

model_xgb <- boost_tree("regression") %>%
    set_engine("xgboost")

wflw_fit_xgb <- workflow() %>%
    add_model(model_xgb) %>%
    add_recipe(recipe_ml) %>%
    fit(training(splits))

calib_tbl <- modeltime_table(
    wflw_fit_xgb
) %>%
    modeltime_calibrate(
        testing(splits), 
        id = "id"
)

# CONFORMAL SPLIT (GLOBAL CONFIDENCE) ----

# Global
test_forecast_global_tbl <- calib_tbl %>%
    modeltime_forecast(
        new_data      = testing(splits),
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_split",
        conf_by_id    = FALSE, # GLOBAL CONFIDENCE 
        keep_data     = TRUE
    )

# New: Displays Conf Method, Conf Interval and Conf by ID
test_forecast_global_tbl
#> # Forecast Results
#> 
#> Conf Method: conformal_split | Conf Interval: 0.95 | Conf By ID: FALSE (GLOBAL
#> CONFIDENCE)
#> # A tibble: 1,365 x 10
#>    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi id   
#>        <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl> <fct>
#>  1        NA ACTUAL      actual 2010-02-05 24924.       NA       NA 1_1  
#>  2        NA ACTUAL      actual 2010-02-12 46039.       NA       NA 1_1  
#>  3        NA ACTUAL      actual 2010-02-19 41596.       NA       NA 1_1  
#>  4        NA ACTUAL      actual 2010-02-26 19404.       NA       NA 1_1  
#>  5        NA ACTUAL      actual 2010-03-05 21828.       NA       NA 1_1  
#>  6        NA ACTUAL      actual 2010-03-12 21043.       NA       NA 1_1  
#>  7        NA ACTUAL      actual 2010-03-19 22137.       NA       NA 1_1  
#>  8        NA ACTUAL      actual 2010-03-26 26229.       NA       NA 1_1  
#>  9        NA ACTUAL      actual 2010-04-02 57258.       NA       NA 1_1  
#> 10        NA ACTUAL      actual 2010-04-09 42961.       NA       NA 1_1  
#> # i 1,355 more rows
#> # i 2 more variables: Date <date>, Weekly_Sales <dbl>

# Visualize 
test_forecast_global_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = F,
        .title       = "Test Prediction | Conformal Split | Global Confidence"
    )


# LOCAL -----

# Global
test_forecast_local_tbl <- calib_tbl %>%
    modeltime_forecast(
        new_data      = testing(splits),
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_split",
        conf_by_id    = TRUE # LOCAL CONFIDENCE  
    )

# New: Displays Conf Method, Conf Interval and Conf by ID
test_forecast_local_tbl
#> # Forecast Results
#> 
#> Conf Method: conformal_split | Conf Interval: 0.95 | Conf By ID: TRUE (LOCAL
#> CONFIDENCE)
#> # A tibble: 1,365 x 8
#>    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi id   
#>        <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl> <fct>
#>  1        NA ACTUAL      actual 2010-02-05 24924.       NA       NA 1_1  
#>  2        NA ACTUAL      actual 2010-02-12 46039.       NA       NA 1_1  
#>  3        NA ACTUAL      actual 2010-02-19 41596.       NA       NA 1_1  
#>  4        NA ACTUAL      actual 2010-02-26 19404.       NA       NA 1_1  
#>  5        NA ACTUAL      actual 2010-03-05 21828.       NA       NA 1_1  
#>  6        NA ACTUAL      actual 2010-03-12 21043.       NA       NA 1_1  
#>  7        NA ACTUAL      actual 2010-03-19 22137.       NA       NA 1_1  
#>  8        NA ACTUAL      actual 2010-03-26 26229.       NA       NA 1_1  
#>  9        NA ACTUAL      actual 2010-04-02 57258.       NA       NA 1_1  
#> 10        NA ACTUAL      actual 2010-04-09 42961.       NA       NA 1_1  
#> # i 1,355 more rows

# Visualize 
test_forecast_local_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = F,
        .title       = "Test Prediction | Conformal Split Method | Local Confidence"
    )


# FUTURE FORECAST ----

refit_tbl <- calib_tbl %>%
    modeltime_refit(walmart_sales_tbl)

# Global
future_forecast_global_tbl <- refit_tbl %>%
    modeltime_forecast(
        new_data      = new_data_tbl,
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_split",
        conf_by_id    = FALSE, # GLOBAL CONFIDENCE
        keep_data     = TRUE
    )

future_forecast_global_tbl
#> # Forecast Results
#> 
#> Conf Method: conformal_split | Conf Interval: 0.95 | Conf By ID: FALSE (GLOBAL
#> CONFIDENCE)
#> # A tibble: 1,365 x 10
#>    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi id   
#>        <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl> <fct>
#>  1        NA ACTUAL      actual 2010-02-05 24924.       NA       NA 1_1  
#>  2        NA ACTUAL      actual 2010-02-12 46039.       NA       NA 1_1  
#>  3        NA ACTUAL      actual 2010-02-19 41596.       NA       NA 1_1  
#>  4        NA ACTUAL      actual 2010-02-26 19404.       NA       NA 1_1  
#>  5        NA ACTUAL      actual 2010-03-05 21828.       NA       NA 1_1  
#>  6        NA ACTUAL      actual 2010-03-12 21043.       NA       NA 1_1  
#>  7        NA ACTUAL      actual 2010-03-19 22137.       NA       NA 1_1  
#>  8        NA ACTUAL      actual 2010-03-26 26229.       NA       NA 1_1  
#>  9        NA ACTUAL      actual 2010-04-02 57258.       NA       NA 1_1  
#> 10        NA ACTUAL      actual 2010-04-09 42961.       NA       NA 1_1  
#> # i 1,355 more rows
#> # i 2 more variables: Date <date>, Weekly_Sales <dbl>

future_forecast_global_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = F,
        .title       = "Future Prediction | Conformal Split Method | Global Confidence"
    )


# Local
future_forecast_local_tbl <- refit_tbl %>%
    modeltime_forecast(
        new_data      = new_data_tbl,
        actual_data   = walmart_sales_tbl,
        conf_interval = 0.95,
        conf_method   = "conformal_split",
        conf_by_id    = TRUE # LOCAL CONFIDENCE  
    )

future_forecast_local_tbl
#> # Forecast Results
#> 
#> Conf Method: conformal_split | Conf Interval: 0.95 | Conf By ID: TRUE (LOCAL
#> CONFIDENCE)
#> # A tibble: 1,365 x 8
#>    .model_id .model_desc .key   .index     .value .conf_lo .conf_hi id   
#>        <int> <chr>       <fct>  <date>      <dbl>    <dbl>    <dbl> <fct>
#>  1        NA ACTUAL      actual 2010-02-05 24924.       NA       NA 1_1  
#>  2        NA ACTUAL      actual 2010-02-12 46039.       NA       NA 1_1  
#>  3        NA ACTUAL      actual 2010-02-19 41596.       NA       NA 1_1  
#>  4        NA ACTUAL      actual 2010-02-26 19404.       NA       NA 1_1  
#>  5        NA ACTUAL      actual 2010-03-05 21828.       NA       NA 1_1  
#>  6        NA ACTUAL      actual 2010-03-12 21043.       NA       NA 1_1  
#>  7        NA ACTUAL      actual 2010-03-19 22137.       NA       NA 1_1  
#>  8        NA ACTUAL      actual 2010-03-26 26229.       NA       NA 1_1  
#>  9        NA ACTUAL      actual 2010-04-02 57258.       NA       NA 1_1  
#> 10        NA ACTUAL      actual 2010-04-09 42961.       NA       NA 1_1  
#> # i 1,355 more rows

future_forecast_local_tbl %>%
    group_by(id) %>%
    plot_modeltime_forecast(
        .facet_ncol  = 2, 
        .interactive = F,
        .title       = "Future Prediction | Conformal Split Method | Local Confidence"
    )

Created on 2023-09-04 with reprex v2.0.2