Closed Teett closed 3 years ago
From what I can tell we are losing the "id" column somewhere along the way. Will need to investigate further.
Hi @mdancho84 ,
The problem is in predict_recursive_panel_workflow function, the issue is related with these lines:
blueprint <- workflow$pre$mold$blueprint
forged <- hardhat::forge(new_data, blueprint)
new_data <- forged$predictors
and the problem is basically due to the fact that in the recipe the variable "id" is being eliminated by using the step step_dummy(all_nominal(), one_hot = TRUE)
, which converts the variable id into id_M1, id_M1000 etc.
We need to think of a way to deal with this situation.
Regards,
Great observation @AlbertoAlmuinha. The latest commit should fix it. The problem is xgboost
is really picky with how the columns need to be provided.
Here's a working example:
# Libraries & Setup ----
library(modeltime)
library(tidymodels)
library(tidyverse)
library(lubridate)
library(timetk)
library(slider)
FORECAST_HORIZON <- 24
m4_extended <- m4_monthly %>%
group_by(id) %>%
future_frame(
.length_out = FORECAST_HORIZON,
.bind_data = TRUE
) %>%
ungroup()
#> .date_var is missing. Using: date
# TRANSFORM FUNCTION ----
# - NOTE - We create lags by group
lag_roll_transformer_grouped <- function(data){
data %>%
group_by(id) %>%
tk_augment_lags(value, .lags = 1:FORECAST_HORIZON) %>%
tk_augment_slidify(
.value = contains("lag12"),
.f = ~mean(.x, na.rm = T),
.period = c(12, 24, 36),
.partial = TRUE
) %>%
ungroup()
}
m4_lags <- m4_extended %>%
lag_roll_transformer_grouped()
train_data <- m4_lags %>%
drop_na()
future_data <- m4_lags %>%
filter(is.na(value))
splits <- train_data %>%
time_series_split(date, assess = FORECAST_HORIZON, cumulative = TRUE)
xgb_spec <- boost_tree(
mode = "regression",
learn_rate = 0.35
) %>%
set_engine("xgboost")
recipe_spec <- recipe(value ~ ., data = training(splits)) %>%
step_timeseries_signature(date) %>%
step_rm(matches("(.xts$)|(.iso$)|(hour)|(minute)|(second)|(am.pm)")) %>%
step_rm(date) %>%
step_normalize(date_index.num) %>%
step_dummy(all_nominal(), one_hot = TRUE)
# Modeling Autoregressive Panel Data
set.seed(123)
model_fit_xgb_recursive <- workflow() %>%
add_model(xgb_spec) %>%
add_recipe(recipe_spec) %>%
fit(training(splits)) %>%
recursive(
id = "id", # We add an id = "id" to specify the groups
transform = lag_roll_transformer_grouped,
# We use panel_tail() to grab tail by groups
train_tail = panel_tail(training(splits), id, FORECAST_HORIZON)
)
# Models table
models_tbl <- modeltime_table(
model_fit_xgb_recursive
)
# Calibrate
calibration_tbl <- models_tbl %>%
modeltime_calibrate(new_data = testing(splits),
quiet = FALSE)
calibration_tbl %>%
modeltime_forecast(
new_data = testing(splits),
actual_data = m4_lags %>% drop_na(),
keep_data = TRUE
) %>%
group_by(id) %>%
plot_modeltime_forecast(.interactive = FALSE, .facet_ncol = 2)
Created on 2021-03-22 by the reprex package (v1.0.0)
I believe this is resolved. I'm closing now.
Hi. I've been experimenting with the new
recursive()
function quite a lot for panel data and I think the calibration process might not be working properly when using a recipe specification, maybe I'm doing it incorrectly though. Leaving a reproducible example here:During the Calibrate process the following error appears:
I've noticed the .model in the models_tbl changes to from <fit[+]> when using a workflow with a recipe. Maybe this is a clue?