business-science / modeltime

Modeltime unlocks time series forecast models and machine learning in one framework
https://business-science.github.io/modeltime/
Other
515 stars 78 forks source link

Recursive forecasting #49

Closed krzjoa closed 3 years ago

krzjoa commented 3 years ago

Among the popular strategies of time series modelling, we can mention regression models with lagged variables. Such variables are often created by shifting our dependent variable(s). It makes us use some special approches to deal with lags in new data, i. a. recursive forecasting. As far as I know, currently there is no widely used R library, which deliver that feature. Interestingly, @edgBR in #5 names fable and modeltime as packages for recursive forecasting, which indicates he probably used some other sense of this notion.

I've written this issue, because I started working on an add-in for tidymodels/modeltime ecosystem, which facilitates turning regular regression models into recursive ones. Proposed API may look as follows:

library(parsnip)
library(recipes)

dax_stock <- 
  as_tibble(EuStockMarkets) %>% 
  select(DAX) %>% 
  bind_rows(tibble(DAX = rep(NA, 30)))

recipe_dax_stock <-
  recipe(DAX ~ ., data = dax_stock) %>% 
  step_lag(all_outcomes(), lag = 1:5) %>% 
  prep()

model_linear <- 
  linear_reg() %>% 
  set_engine("lm") %>% 
  fit(DAX ~ ., data = dax_stock)

# Here, we add recursion to the model
# We pass recipe to re-generate new data after each step
# We get a model with additional class, say: 'recursive'
recursive_linear <- 
  model_linear %>% 
  recursive(recipe_dax_stock)

# predict.recursive, which internally calls predict.model_fit
recursive_linear %>% 
  predict(new_data)

Obviously, there is a couple of places, where the implementaions should be well thought out. I can elaborate it later if needed.

After this longish introduction, I would like to ask: Would you be interested in including recursive forecasting into modeltime or it lies outside the scope of this great library?

mdancho84 commented 3 years ago

Yes - This is something needed. I'm not sure how best to handle the recursion since most models (e.g. arima) handle the lag process internally. Do you have any code to share?

krzjoa commented 3 years ago

I do. 😎 My PoC looks like that:

library(dplyr)
library(parsnip)
library(recipes)
library(timetk)

dax_stock <- 
  as_tibble(EuStockMarkets) %>% 
  select(DAX) %>% 
  bind_rows(tibble(DAX = rep(NA, 30))) # Adding new data

recipe_dax_stock <-
  recipe(DAX ~ ., data = dax_stock) %>% 
  step_lag(all_outcomes(), lag = 1:10) %>% 
  prep() 

dax_stock_m <- 
  juice(recipe_dax_stock)

train_data <- 
  dax_stock_m %>% 
  filter(!is.na(DAX)) %>% 
  na.omit()

new_data <- 
  dax_stock_m %>% 
  filter(is.na(DAX))

#' Create a recursive model for time series forecast 
#' from a parsnip regression model
recursive <- function(object, recipe, train_tail, ...){
  object$spec[["forecast"]] <- "recursive"
  object$spec[["recipe"]] <- recipe
  object$spec[["train_tail"]] <- train_tail
  .class <- class(object)
  class(object) <- c(.class[1], "recursive", .class[2])
  object
}

predict.recursive <- function(object, new_data, type = NULL, opts = list(), ...){

  .recipe <- object$spec[["recipe"]]

  .derived_features <- 
    .recipe$term_info %>% 
    filter(source == "derived") %>% 
    .$variable

  .preds <- 
    tibble(.pred = numeric(nrow(new_data)))

  .first_slice <- 
    new_data %>% 
    slice_head(1)

  .preds[1,] <- new_data[1, object$preproc$y_var] <- 
    predict.model_fit(
      object, new_data = .first_slice, 
      type = type, opts = opts, ...
    )

  for (i in 2:nrow(.preds)) {

    .temp_new_data <- 
      bind_rows(
        object$spec$train_tail,
        new_data
      ) %>% 
      select(-!!.derived_features)

    .nth_slice <- 
      bake(.recipe, new_data = .temp_new_data) %>% 
      slice_tail(n = nrow(new_data)) %>% 
      .[i, ]

    .preds[i,] <- new_data[i, object$preproc$y_var] <- 
      predict.model_fit(
        object, new_data = .nth_slice, 
        type = type, opts = opts, ...
      )
  }
  .preds
}

model_linear <- 
  linear_reg() %>% 
  set_engine("lm") %>% 
  fit(DAX ~ ., data = train_data)

recursive_linear <- 
  model_linear %>% 
  recursive(recipe_dax_stock,
            train_tail = tail(train_data, 10))

pred <- recursive_linear %>% 
  predict(new_data)

Some remarks:

mdancho84 commented 3 years ago

Awesome! This looks very cool. =)

Just a quick point about my bandwidth - I'm working on another project at the moment, so I won't be able to work on this for the next few weeks.

Any help here will be much appreciated. I'll certainly be able to provide feedback/guidance as necessary to keep you moving forward. Love the work so far. Very exciting.

krzjoa commented 3 years ago

I've added a pull request #50 .

edgBR commented 3 years ago

Hi colleagues!

This looks great!!!

Very similar to what I implemented I think that the only difference is that I was using time_tk::future_frame to generate the new data. It might be worth to take a look of how to speed up the recursive part, maybe some of the ideas of: https://github.com/mailund/tailr is useful here.

If you have the need to have multiple models by different forecasting horizons I also recommend you to look to forecastML. Very nice package for direct forecasting.

BR /Edgar

mdancho84 commented 3 years ago

This is awesome. Great work @krzjoa! Thanks so much for your hard work. Well done.

I'm going to work on upgrading the functionality a bit so we can use for workflows, but other than that, I think it's good to go.

Example.

library(modeltime)
library(tidymodels)
library(tidyverse)
library(lubridate)
library(timetk)

m750
#> # A tibble: 306 x 3
#>    id    date       value
#>    <fct> <date>     <dbl>
#>  1 M750  1990-01-01  6370
#>  2 M750  1990-02-01  6430
#>  3 M750  1990-03-01  6520
#>  4 M750  1990-04-01  6580
#>  5 M750  1990-05-01  6620
#>  6 M750  1990-06-01  6690
#>  7 M750  1990-07-01  6000
#>  8 M750  1990-08-01  5450
#>  9 M750  1990-09-01  6480
#> 10 M750  1990-10-01  6820
#> # … with 296 more rows

FORECAST_HORIZON <- 24

m750_extended <- m750 %>%
    group_by(id) %>%
    future_frame(
        .length_out = FORECAST_HORIZON,
        .bind_data  = TRUE
    )
#> .date_var is missing. Using: date

recipe_lag <- recipe(value ~ date, m750_extended) %>%
    step_lag(value, lag = 1:12) 

m750_lagged <- recipe_lag %>% prep() %>% juice()

m750_lagged
#> # A tibble: 330 x 14
#>    date       value lag_1_value lag_2_value lag_3_value lag_4_value lag_5_value
#>    <date>     <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#>  1 1990-01-01  6370          NA          NA          NA          NA          NA
#>  2 1990-02-01  6430        6370          NA          NA          NA          NA
#>  3 1990-03-01  6520        6430        6370          NA          NA          NA
#>  4 1990-04-01  6580        6520        6430        6370          NA          NA
#>  5 1990-05-01  6620        6580        6520        6430        6370          NA
#>  6 1990-06-01  6690        6620        6580        6520        6430        6370
#>  7 1990-07-01  6000        6690        6620        6580        6520        6430
#>  8 1990-08-01  5450        6000        6690        6620        6580        6520
#>  9 1990-09-01  6480        5450        6000        6690        6620        6580
#> 10 1990-10-01  6820        6480        5450        6000        6690        6620
#> # … with 320 more rows, and 7 more variables: lag_6_value <dbl>,
#> #   lag_7_value <dbl>, lag_8_value <dbl>, lag_9_value <dbl>,
#> #   lag_10_value <dbl>, lag_11_value <dbl>, lag_12_value <dbl>

train_data <- m750_lagged %>%
    filter(!is.na(value))

future_data <- m750_lagged %>%
    filter(is.na(value))

model_fit_lm <- linear_reg() %>%
    set_engine("lm") %>%
    fit(value ~ date, data = train_data)

model_fit_lm_recursive <- linear_reg() %>%
    set_engine("lm") %>%
    fit(value ~ ., data = train_data) %>%
    recursive(
        transform  = recipe_lag,
        train_tail = tail(train_data, 12)
    )

model_fit_lm_recursive
#> parsnip model object
#> 
#> Fit time:  2ms 
#> 
#> Call:
#> stats::lm(formula = value ~ ., data = data)
#> 
#> Coefficients:
#>  (Intercept)          date   lag_1_value   lag_2_value   lag_3_value  
#>   301.639563     -0.003014      0.442504     -0.078007     -0.035787  
#>  lag_4_value   lag_5_value   lag_6_value   lag_7_value   lag_8_value  
#>     0.024823      0.057506     -0.048502      0.049845     -0.068797  
#>  lag_9_value  lag_10_value  lag_11_value  lag_12_value  
#>    -0.019639      0.016546      0.110980      0.531658

modeltime_table(
    model_fit_lm,
    model_fit_lm_recursive
) %>%
    modeltime_forecast(
        new_data    = future_data,
        actual_data = m750,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast()

Created on 2020-11-19 by the reprex package (v0.3.0)

mdancho84 commented 3 years ago

Updates

I'm happy with how recursive() has turned out. I've been pretty successful with testing it. I have a few updates:

Examples

The examples illustrate several of the key features of recursive. This is the same example that is in the documentation now. The example requires timetk >= 2.6.0 (just submitted to CRAN) along with the dev version of modeltime.

# Libraries & Setup ----
library(modeltime)
library(tidymodels)
library(tidyverse)
library(lubridate)
library(timetk)
library(slider)

m750
#> # A tibble: 306 x 3
#>    id    date       value
#>    <fct> <date>     <dbl>
#>  1 M750  1990-01-01  6370
#>  2 M750  1990-02-01  6430
#>  3 M750  1990-03-01  6520
#>  4 M750  1990-04-01  6580
#>  5 M750  1990-05-01  6620
#>  6 M750  1990-06-01  6690
#>  7 M750  1990-07-01  6000
#>  8 M750  1990-08-01  5450
#>  9 M750  1990-09-01  6480
#> 10 M750  1990-10-01  6820
#> # … with 296 more rows

FORECAST_HORIZON <- 24

m750_extended <- m750 %>%
    group_by(id) %>%
    future_frame(
        .length_out = FORECAST_HORIZON,
        .bind_data  = TRUE
    ) %>%
    ungroup()
#> .date_var is missing. Using: date

# METHOD 1: RECIPE ----
# - Used for recursive transformations via recipe prepeocessing steps

# Lag Recipe
recipe_lag <- recipe(value ~ date, m750_extended) %>%
    step_lag(value, lag = 1:FORECAST_HORIZON)

# Data Preparation
m750_lagged <- recipe_lag %>% prep() %>% juice()

m750_lagged
#> # A tibble: 330 x 26
#>    date       value lag_1_value lag_2_value lag_3_value lag_4_value lag_5_value
#>    <date>     <dbl>       <dbl>       <dbl>       <dbl>       <dbl>       <dbl>
#>  1 1990-01-01  6370          NA          NA          NA          NA          NA
#>  2 1990-02-01  6430        6370          NA          NA          NA          NA
#>  3 1990-03-01  6520        6430        6370          NA          NA          NA
#>  4 1990-04-01  6580        6520        6430        6370          NA          NA
#>  5 1990-05-01  6620        6580        6520        6430        6370          NA
#>  6 1990-06-01  6690        6620        6580        6520        6430        6370
#>  7 1990-07-01  6000        6690        6620        6580        6520        6430
#>  8 1990-08-01  5450        6000        6690        6620        6580        6520
#>  9 1990-09-01  6480        5450        6000        6690        6620        6580
#> 10 1990-10-01  6820        6480        5450        6000        6690        6620
#> # … with 320 more rows, and 19 more variables: lag_6_value <dbl>,
#> #   lag_7_value <dbl>, lag_8_value <dbl>, lag_9_value <dbl>,
#> #   lag_10_value <dbl>, lag_11_value <dbl>, lag_12_value <dbl>,
#> #   lag_13_value <dbl>, lag_14_value <dbl>, lag_15_value <dbl>,
#> #   lag_16_value <dbl>, lag_17_value <dbl>, lag_18_value <dbl>,
#> #   lag_19_value <dbl>, lag_20_value <dbl>, lag_21_value <dbl>,
#> #   lag_22_value <dbl>, lag_23_value <dbl>, lag_24_value <dbl>

train_data <- m750_lagged %>%
    filter(!is.na(value)) %>%
    drop_na()

future_data <- m750_lagged %>%
    filter(is.na(value))

# Modeling
model_fit_lm <- linear_reg() %>%
    set_engine("lm") %>%
    fit(value ~ date, data = train_data)

model_fit_lm_recursive <- linear_reg() %>%
    set_engine("lm") %>%
    fit(value ~ ., data = train_data) %>%
    recursive(
        transform  = recipe_lag,
        train_tail = tail(train_data, FORECAST_HORIZON)
    )

model_fit_lm_recursive
#> Recursive [parsnip model]
#> 
#> parsnip model object
#> 
#> Fit time:  4ms 
#> 
#> Call:
#> stats::lm(formula = value ~ ., data = data)
#> 
#> Coefficients:
#>  (Intercept)          date   lag_1_value   lag_2_value   lag_3_value  
#>    164.14732       0.00677       0.61244       0.18402      -0.07128  
#>  lag_4_value   lag_5_value   lag_6_value   lag_7_value   lag_8_value  
#>      0.12089      -0.01750       0.07095       0.09785      -0.08053  
#>  lag_9_value  lag_10_value  lag_11_value  lag_12_value  lag_13_value  
#>      0.04887       0.03030      -0.01755       0.73318      -0.52958  
#> lag_14_value  lag_15_value  lag_16_value  lag_17_value  lag_18_value  
#>     -0.21410       0.07734      -0.13879       0.04351      -0.08894  
#> lag_19_value  lag_20_value  lag_21_value  lag_22_value  lag_23_value  
#>     -0.08732       0.06641      -0.05737      -0.02331       0.05754  
#> lag_24_value  
#>      0.15960

# Forecasting
modeltime_table(
    model_fit_lm,
    model_fit_lm_recursive
) %>%
    update_model_description(2, "LM - Lag Roll") %>%
    modeltime_forecast(
        new_data    = future_data,
        actual_data = m750,
        keep_data   = TRUE
    ) %>%
    plot_modeltime_forecast(
        .interactive        = FALSE,
        .conf_interval_show = FALSE
    )


# METHOD 2: TRANSFORM FUNCTION ----
# - Used for complex transformations via transformation function

# Function run recursively that updates the forecasted dataset
lag_roll_transformer <- function(data){
    data %>%
        # Lags
        tk_augment_lags(value, .lags = 1:12) %>%
        # Rolling Features
        mutate(rolling_mean_12 = lag(slide_dbl(
            value, .f = mean, .before = 12, .complete = FALSE
        ), 1))
}

# Data Preparation
m750_rolling <- m750_extended %>%
    lag_roll_transformer() %>%
    select(-id)

train_data <- m750_rolling %>%
    filter(!is.na(value)) %>%
    drop_na()

future_data <- m750_rolling %>%
    filter(is.na(value))

# Modeling
model_fit_lm <- linear_reg() %>%
    set_engine("lm") %>%
    fit(value ~ date, data = train_data)

model_fit_lm_recursive <- linear_reg() %>%
    set_engine("lm") %>%
    fit(value ~ ., data = train_data) %>%
    recursive(
        transform  = lag_roll_transformer,
        train_tail = tail(train_data, FORECAST_HORIZON)
    )

model_fit_lm_recursive
#> Recursive [parsnip model]
#> 
#> parsnip model object
#> 
#> Fit time:  2ms 
#> 
#> Call:
#> stats::lm(formula = value ~ ., data = data)
#> 
#> Coefficients:
#>     (Intercept)             date       value_lag1       value_lag2  
#>       147.32008          0.01273          1.59298          0.76666  
#>      value_lag3       value_lag4       value_lag5       value_lag6  
#>         0.73081          0.76950          0.76871          0.74755  
#>      value_lag7       value_lag8       value_lag9      value_lag10  
#>         0.77872          0.72985          0.75257          0.76582  
#>     value_lag11      value_lag12  rolling_mean_12  
#>         0.79979          1.62469         -9.85822

# Forecasting
modeltime_table(
    model_fit_lm,
    model_fit_lm_recursive
) %>%
    update_model_description(2, "LM - Lag Roll") %>%
    modeltime_forecast(
        new_data    = future_data,
        actual_data = m750
    ) %>%
    plot_modeltime_forecast(
        .interactive        = FALSE,
        .conf_interval_show = FALSE
    )

Created on 2020-11-21 by the reprex package (v0.3.0)