JoaquinAmatRodrigo / skforecast

Time series forecasting with machine learning models
https://skforecast.org
BSD 3-Clause "New" or "Revised" License
1k stars 113 forks source link

XGBRegressor can predict zero output basead on external variable? #389

Closed JorgeGomes72 closed 1 year ago

JorgeGomes72 commented 1 year ago

Hello, I'm building a time series model, for forecasting Sales, based on the history since 2019. My model uses SKForecast with XGBRegressor.

I want to forecast 75 days. My target is SALES.

I use external features to help model, transformed in 0 and 1.

I would like to understand why my final forecast have non-zero values on target variable SALES on Sundays, even when i use a external feature OPEN=0 and even when i have SALES=0 on every Sundays in the history.

My dataset have this struture: DATA| SALES| YEAR | WEEK | WEEKDAY | OPEN

ex: image

The variable "WEEKDAY"= 7 means Sunday. On dataset i have every Sundays with SALES=0, with OPEN=0.

The external feature "OPEN"=0 means that store is closed, OPEN=1 means store open.

This is my final dataset (vendas_df2) before execute model :

image

The exog_variables uses all the external features, except target variable (SALES)

This is the train, validation and test : image

This is the parameters for model:

Create forecaster

======================================

forecaster = ForecasterAutoreg( regressor = XGBRegressor(random_state=123), lags = 7 #24 )

Grid search of hyperparameters and lags

========================================

Regressor hyperparameters

param_grid = { 'n_estimators': [100, 500], 'max_depth': [3, 5, 10], 'learning_rate': [0.01, 0.1] }

Lags used as predictors

lags_grid = [7, 30, 48, 72, [1, 2, 3, 7, 23, 24, 25, 71, 72, 73]]

results_grid = grid_search_forecaster( forecaster = forecaster, y = vendas_df2.loc[:end_validation, 'SALES'], exog = vendas_df2.loc[:end_validation, exog_variables],
param_grid = param_grid, lags_grid = lags_grid, steps = 75, refit = False, metric = 'mean_squared_error', initial_train_size = int(len(data_train)), fixed_train_size = False, return_best = True, verbose = False )

Backtesting test data

=========================================

metric, predictions = backtesting_forecaster( forecaster = forecaster, y = vendas_df2['SALES'], exog = vendas_df2[exog_variables], initial_train_size = len(vendas_df2.loc[:end_validation]), fixed_train_size = False, steps = 75, refit = False, metric = 'mean_squared_error', verbose = False )

print(f"Backtest error: {metric}")

This is the final result with forecast for May:

image

We can see that Sundays have SALES <>0.

image

My dataset has OPEN=0 for Sundays, so why i can't forecast zero values for Sundays (prev=0) ?

image

Can you help please? Thank you!

Jorge Gomes

Originally posted by @JorgeGomes72 in https://github.com/JoaquinAmatRodrigo/skforecast/discussions/388

JoaquinAmatRodrigo commented 1 year ago

Hi @JorgeGomes72 , thanks for using skforecast. Your use case is what we call an intermittent demand with a regular pattern. This is a field we are actively exploring, and we will soon publish a user's guide that will discuss it.

Why does your model predict non-zero values? I see that the predicted values for Sundays are much lower than predictions for other days, yet the model is not able to learn that it should be exactly zero. Since you mention that there are no Sundays in the historical data with sales, the problem is not in the training data. I think the model has learned that sales on one day are highly correlated with sales on previous days. This is true for all days of the week except Sundays. So the model is torn between learning a general pattern and a local one.

How to solve it.

There are some options. If what is really important for your business case is the predictions for all days except Sunday, try to find the model that best predicts the days Monday through Saturday (you can use a custom metric to ignore specific values). Once you have the predictions, replace all Sunday values with zero.

If you really need the output of the model to be always positive, you can apply a log transformation to the data before fitting the model.

I hope this helps!

JorgeGomes72 commented 1 year ago

Hello, i am doing that, replace final sunday predictions with 0.

Do you think this is s model problem? not a skforecast problem? i mean, peraphs my result was diferent if i did't use skforecast? what do you think?

Thank you very much, JG

JoaquinAmatRodrigo commented 1 year ago

Hi @JorgeGomes72, I would say that it is related to the learning process of the regressor. However, I encourage you to compare the results with other models or libraries. You may find better performance.

Let us know if you find interesting results!

JorgeGomes72 commented 1 year ago

Hello, i need help to "reorganize" my model after search for hyperparameters.

My inicial model:

Create forecaster

forecaster = ForecasterAutoreg( regressor = XGBRegressor(random_state=123), lags = 24, weight_func = custom_weights )

My grid search:

Regressor hyperparameters

param_grid = { 'n_estimators': [100, 500], 'max_depth': [3, 5, 10], 'learning_rate': [0.01, 0.1] }

Lags used as predictors

lags_grid = [24, 30, 48, 72, 168, [1, 2, 3, 7, 23, 24, 25, 71, 72, 73, 168]]

results_grid = grid_search_forecaster( forecaster = forecaster, y = vendas_df2.loc[:end_validation, 'SALES'], exog = vendas_df2.loc[:end_validation, exog_variables],
param_grid = param_grid, lags_grid = lags_grid, steps = 2200, refit = False, metric = 'mean_squared_error', #custom_metric, # initial_train_size = int(len(data_train)), fixed_train_size = False, return_best = True, verbose = False )

After search for best parameters, this is output:

================= ForecasterAutoreg ================= Regressor: XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None) Lags: [ 1 2 3 7 23 24 25 71 72 73 168] Transformer for y: None Transformer for exog: None Window size: 168 Weight function included: True Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['OPEN', 'FERIADO', 'YEAR_2019', 'YEAR_2020', 'YEAR_2021', 'YEAR_2022', 'YEAR_2023', 'WEEK_1', 'WEEK_2', 'WEEK_3', 'WEEK_4', 'WEEK_5', 'WEEK_6', 'WEEK_7', 'WEEK_8', 'WEEK_9', 'WEEK_10', 'WEEK_11', 'WEEK_12', 'WEEK_13', 'WEEK_14', 'WEEK_15', 'WEEK_16', 'WEEK_17', 'WEEK_18', 'WEEK_19', 'WEEK_20', 'WEEK_21', 'WEEK_22', 'WEEK_23', 'WEEK_24', 'WEEK_25', 'WEEK_26', 'WEEK_27', 'WEEK_28', 'WEEK_29', 'WEEK_30', 'WEEK_31', 'WEEK_32', 'WEEK_33', 'WEEK_34', 'WEEK_35', 'WEEK_36', 'WEEK_37', 'WEEK_38', 'WEEK_39', 'WEEK_40', 'WEEK_41', 'WEEK_42', 'WEEK_43', 'WEEK_44', 'WEEK_45', 'WEEK_46', 'WEEK_47', 'WEEK_48', 'WEEK_49', 'WEEK_50', 'WEEK_51', 'WEEK_52', 'WEEK_53', 'WEEKDAY_1', 'WEEKDAY_2', 'WEEKDAY_3', 'WEEKDAY_4', 'WEEKDAY_5', 'WEEKDAY_6', 'WEEKDAY_7', 'HOUR_0', 'HOUR_1', 'HOUR_2', 'HOUR_3', 'HOUR_4', 'HOUR_5', 'HOUR_6', 'HOUR_7', 'HOUR_8', 'HOUR_9', 'HOUR_10', 'HOUR_11', 'HOUR_12', 'HOUR_13', 'HOUR_14', 'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19', 'HOUR_20', 'HOUR_21', 'HOUR_22', 'HOUR_23'] Training range: [Timestamp('2019-01-01 00:00:00'), Timestamp('2022-09-30 23:00:00')] Training index type: DatetimeIndex Training index frequency: H Regressor parameters: {'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'enable_categorical': False, 'gamma': 0, 'gpu_id': -1, 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 8, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 123, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None} Creation date: 2023-04-13 21:51:22 Last fit date: 2023-04-14 01:28:46 Skforecast version: 0.7.0 Python version: 3.8.8 Forecaster id: None

So, i need to put all this "best parameters" on the model, i must test a lot of identical "stores" and don't need to test again!

i tried this:

Create forecaster2 with best parameters

==================================

forecaster2 = ForecasterAutoreg ( regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None, window_size = 168, included_exog = True,
exog_col_names = exog_variables),
lags=[1, 2, 3, 7, 23, 24, 25, 71, 72, 73, 168], weight_func = custom_weights
)

but the result seems different:

================= ForecasterAutoreg ================= Regressor: XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, exog_col_names=['OPEN', 'FERIADO', 'YEAR_2019', 'YEAR_2020', 'YEAR_2021', 'YEAR_2022', 'YEAR_2023', 'WEEK_1', 'WEEK_2', 'WEEK_3', 'WEEK_4', 'WEEK_5', 'WEEK_6', 'WEEK_7', 'WEEK_8', 'WEEK_9', 'WEEK_10', 'WEEK_11', 'WEEK_12', 'WEEK_13', 'WEEK_14... gamma=0, gpu_id=-1, importance_type=None, included_exog=True, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None, ...) Lags: [ 1 2 3 7 23 24 25 71 72 73 168] Transformer for y: None Transformer for exog: None Window size: 168 Weight function included: True Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: None Training index type: None Training index frequency: None Regressor parameters: {'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'enable_categorical': False, 'gamma': 0, 'gpu_id': -1, 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 8, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 123, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None, 'window_size': 168, 'included_exog': True, 'exog_col_names': ['OPEN', 'FERIADO', 'YEAR_2019', 'YEAR_2020', 'YEAR_2021', 'YEAR_2022', 'YEAR_2023', 'WEEK_1', 'WEEK_2', 'WEEK_3', 'WEEK_4', 'WEEK_5', 'WEEK_6', 'WEEK_7', 'WEEK_8', 'WEEK_9', 'WEEK_10', 'WEEK_11', 'WEEK_12', 'WEEK_13', 'WEEK_14', 'WEEK_15', 'WEEK_16', 'WEEK_17', 'WEEK_18', 'WEEK_19', 'WEEK_20', 'WEEK_21', 'WEEK_22', 'WEEK_23', 'WEEK_24', 'WEEK_25', 'WEEK_26', 'WEEK_27', 'WEEK_28', 'WEEK_29', 'WEEK_30', 'WEEK_31', 'WEEK_32', 'WEEK_33', 'WEEK_34', 'WEEK_35', 'WEEK_36', 'WEEK_37', 'WEEK_38', 'WEEK_39', 'WEEK_40', 'WEEK_41', 'WEEK_42', 'WEEK_43', 'WEEK_44', 'WEEK_45', 'WEEK_46', 'WEEK_47', 'WEEK_48', 'WEEK_49', 'WEEK_50', 'WEEK_51', 'WEEK_52', 'WEEK_53', 'WEEKDAY_1', 'WEEKDAY_2', 'WEEKDAY_3', 'WEEKDAY_4', 'WEEKDAY_5', 'WEEKDAY_6', 'WEEKDAY_7', 'HOUR_0', 'HOUR_1', 'HOUR_2', 'HOUR_3', 'HOUR_4', 'HOUR_5', 'HOUR_6', 'HOUR_7', 'HOUR_8', 'HOUR_9', 'HOUR_10', 'HOUR_11', 'HOUR_12', 'HOUR_13', 'HOUR_14', 'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19', 'HOUR_20', 'HOUR_21', 'HOUR_22', 'HOUR_23']} Creation date: 2023-04-14 10:30:48 Last fit date: None Skforecast version: 0.7.0 Python version: 3.8.8 Forecaster id: None

Could you help please? Thank you!

JoaquinAmatRodrigo commented 1 year ago

Hi @JorgeGomes72,

What do you mean by "the result seems different"?

Please, try to add the python code inside . It is rendered much more readable way.

JorgeGomes72 commented 1 year ago

Hello Joaquin, i mean the result of forecaster = ForecasterAutoreg(...)

After search best parameters we can see in forecaster:

" Regressor: XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None) Lags: [ 1 2 3 7 23 24 25 71 72 73 168] Transformer for y: None Transformer for exog: None Window size: 168 Weight function included: True Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['OPEN', 'FERIADO', 'YEAR_2019', 'YEAR_2020', 'YEAR_2021', 'YEAR_2022', 'YEAR_2023', 'WEEK_1', 'WEEK_2', 'WEEK_3', 'WEEK_4', 'WEEK_5', 'WEEK_6', 'WEEK_7', 'WEEK_8', 'WEEK_9', 'WEEK_10', 'WEEK_11', 'WEEK_12', 'WEEK_13', 'WEEK_14', 'WEEK_15', 'WEEK_16', 'WEEK_17', 'WEEK_18', 'WEEK_19', 'WEEK_20', 'WEEK_21', 'WEEK_22', 'WEEK_23', 'WEEK_24', 'WEEK_25', 'WEEK_26', 'WEEK_27', 'WEEK_28', 'WEEK_29', 'WEEK_30', 'WEEK_31', 'WEEK_32', 'WEEK_33', 'WEEK_34', 'WEEK_35', 'WEEK_36', 'WEEK_37', 'WEEK_38', 'WEEK_39', 'WEEK_40', 'WEEK_41', 'WEEK_42', 'WEEK_43', 'WEEK_44', 'WEEK_45', 'WEEK_46', 'WEEK_47', 'WEEK_48', 'WEEK_49', 'WEEK_50', 'WEEK_51', 'WEEK_52', 'WEEK_53', 'WEEKDAY_1', 'WEEKDAY_2', 'WEEKDAY_3', 'WEEKDAY_4', 'WEEKDAY_5', 'WEEKDAY_6', 'WEEKDAY_7', 'HOUR_0', 'HOUR_1', 'HOUR_2', 'HOUR_3', 'HOUR_4', 'HOUR_5', 'HOUR_6', 'HOUR_7', 'HOUR_8', 'HOUR_9', 'HOUR_10', 'HOUR_11', 'HOUR_12', 'HOUR_13', 'HOUR_14', 'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19', 'HOUR_20', 'HOUR_21', 'HOUR_22', 'HOUR_23'] Training range: [Timestamp('2019-01-01 00:00:00'), Timestamp('2022-09-30 23:00:00')] Training index type: DatetimeIndex Training index frequency: H Regressor parameters: {'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'enable_categorical': False, 'gamma': 0, 'gpu_id': -1, 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 8, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 123, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None} "

but when i create a new model forecaster2 with best parameters, the result is:

" Regressor: XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, exog_col_names=['OPEN', 'FERIADO', 'YEAR_2019', 'YEAR_2020', 'YEAR_2021', 'YEAR_2022', 'YEAR_2023', 'WEEK_1', 'WEEK_2', 'WEEK_3', 'WEEK_4', 'WEEK_5', 'WEEK_6', 'WEEK_7', 'WEEK_8', 'WEEK_9', 'WEEK_10', 'WEEK_11', 'WEEK_12', 'WEEK_13', 'WEEK_14... gamma=0, gpu_id=-1, importance_type=None, included_exog=True, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None, ...) Lags: [ 1 2 3 7 23 24 25 71 72 73 168] Transformer for y: None Transformer for exog: None Window size: 168 Weight function included: True Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: None Training index type: None Training index frequency: None Regressor parameters: {'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'enable_categorical': False, 'gamma': 0, 'gpu_id': -1, 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 8, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 123, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None, 'window_size': 168, 'included_exog': True, 'exog_col_names': ['OPEN', 'FERIADO', 'YEAR_2019', 'YEAR_2020', 'YEAR_2021', 'YEAR_2022', 'YEAR_2023', 'WEEK_1', 'WEEK_2', 'WEEK_3', 'WEEK_4', 'WEEK_5', 'WEEK_6', 'WEEK_7', 'WEEK_8', 'WEEK_9', 'WEEK_10', 'WEEK_11', 'WEEK_12', 'WEEK_13', 'WEEK_14', 'WEEK_15', 'WEEK_16', 'WEEK_17', 'WEEK_18', 'WEEK_19', 'WEEK_20', 'WEEK_21', 'WEEK_22', 'WEEK_23', 'WEEK_24', 'WEEK_25', 'WEEK_26', 'WEEK_27', 'WEEK_28', 'WEEK_29', 'WEEK_30', 'WEEK_31', 'WEEK_32', 'WEEK_33', 'WEEK_34', 'WEEK_35', 'WEEK_36', 'WEEK_37', 'WEEK_38', 'WEEK_39', 'WEEK_40', 'WEEK_41', 'WEEK_42', 'WEEK_43', 'WEEK_44', 'WEEK_45', 'WEEK_46', 'WEEK_47', 'WEEK_48', 'WEEK_49', 'WEEK_50', 'WEEK_51', 'WEEK_52', 'WEEK_53', 'WEEKDAY_1', 'WEEKDAY_2', 'WEEKDAY_3', 'WEEKDAY_4', 'WEEKDAY_5', 'WEEKDAY_6', 'WEEKDAY_7', 'HOUR_0', 'HOUR_1', 'HOUR_2', 'HOUR_3', 'HOUR_4', 'HOUR_5', 'HOUR_6', 'HOUR_7', 'HOUR_8', 'HOUR_9', 'HOUR_10', 'HOUR_11', 'HOUR_12', 'HOUR_13', 'HOUR_14', 'HOUR_15', 'HOUR_16', 'HOUR_17', 'HOUR_18', 'HOUR_19', 'HOUR_20', 'HOUR_21', 'HOUR_22', 'HOUR_23']} "

it seems different, for exemple i see exog_columns in the Regressor parameters, not outside regressor parameters.

What i want to know is: i must create a model with best parameters like this:

forecaster2 = ForecasterAutoreg ( regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.01, max_delta_step=0, max_depth=3, min_child_weight=1, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None, window_size = 168, included_exog = True, exog_col_names = exog_variables),
lags=[1, 2, 3, 7, 23, 24, 25, 71, 72, 73, 168], weight_func = custom_weights )

or simply just like this, without exog_col_names inside regressor:

forecaster2 = ForecasterAutoreg( regressor = XGBRegressor(random_state=123, learning_rate = 0.01, max_depth = 3, n_estimators = 100), lags = [1, 2, 3, 7, 23, 24, 25, 71, 72, 73, 168],
weight_func = custom_weights )

metric, predictions = backtesting_forecaster( forecaster = forecaster2, y = vendas_df2['SALES'], exog = vendas_df2[exog_variables], initial_train_size = len(vendas_df2.loc[:end_validation]), fixed_train_size = False, steps = 2200, refit = False,

interval = [0, 95],

                        metric             =  'mean_squared_error',   #custom_metric, 
                        verbose           = False
                  )

imagine that you have "return_best = False" on grid_search_forecaster and you must explicitly write best model.

Thank you! JG

JoaquinAmatRodrigo commented 1 year ago

Once you have created the new forecaster instance, you need to train it using the .fit method. The results you are showing are from a not fitted forecaster:

Training range: None
Training index type: None
Training index frequency: None
JorgeGomes72 commented 1 year ago

Hello Joaquin,

something like this?

=====================

forecaster = ForecasterAutoreg ( regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None), lags = 168,
weight_func = custom_weights
)

forecaster.fit(y=vendas_df2.loc[:end_validation, 'SALES']) predictions = forecaster.predict(steps=2200)

or like this one?

=====================

forecaster = ForecasterAutoreg ( regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, enable_categorical=False, gamma=0, gpu_id=-1, importance_type=None, interaction_constraints='', learning_rate=0.1, max_delta_step=0, max_depth=3, min_child_weight=1, monotone_constraints='()', n_estimators=100, n_jobs=8, num_parallel_tree=1, predictor='auto', random_state=123, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact', validate_parameters=1, verbosity=None), lags = 168,
weight_func = custom_weights
)

metric, predictions = backtesting_forecaster( forecaster = forecaster, y = vendas_df2['SALES'], exog = vendas_df2[exog_variables], initial_train_size = len(vendas_df2.loc[:end_validation]), fixed_train_size = False, steps = 2200, refit = False,
metric = 'mean_squared_error',
verbose = False )

Thank you! JG

JoaquinAmatRodrigo commented 1 year ago

To have the same results as in the backtesting, you have to train the forecaster with the same data:

forecaster.fit(y= vendas_df2['SALES'], exog = vendas_df2[exog_variables])

Further more, if you ar including exogenous variables in fit, you should also provide them in the predict.

JorgeGomes72 commented 1 year ago

This is a part of script:

Split train-val-test

============================

end_train = '2021-08-20 23:59:00' end_validation = '2022-09-30 23:59:00'

data_train = vendas_df2.loc[: end_train, :] data_val = vendas_df2.loc[end_train:end_validation, :] data_test = vendas_df2.loc[end_validation:, :]

print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()} (n={len(data_train)})") print(f"Dates validacion : {data_val.index.min()} --- {data_val.index.max()} (n={len(data_val)})") print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()} (n={len(data_test)})")

Custom function to create weights COVID

=================================

def custom_weights(index):
weights = np.where((((index >= '2020-03-10 00:01:00') & (index <= '2020-05-31 23:59:00')) | ((index >= '2021-01-15 00:01:00') & (index <= '2021-04-18 23:59:00'))), 0, 1) return weights

Create forecaster

=============================

forecaster = ForecasterAutoreg( regressor = XGBRegressor(random_state=123), lags = 168, weight_func = custom_weights )

=================

ForecasterAutoreg

=================

Regressor: XGBRegressor(base_score=None, booster=None, colsample_bylevel=None,

colsample_bynode=None, colsample_bytree=None,

enable_categorical=False, gamma=None, gpu_id=None,

importance_type=None, interaction_constraints=None,

learning_rate=0.1, max_delta_step=None, max_depth=3,

min_child_weight=None, missing=nan, monotone_constraints=None,

n_estimators=100, n_jobs=None, num_parallel_tree=None,

predictor=None, random_state=123, reg_alpha=None, reg_lambda=None,

scale_pos_weight=None, subsample=None, tree_method=None,

validate_parameters=None, verbosity=None)

Lags: [ 1 2 3 7 23 24 25 71 72 73 168]

Transformer for y: None

Transformer for exog: None

Window size: 168

Weight function included: True

Exogenous included: False

Type of exogenous variable: None

Exogenous variables names: None

Training range: None

Training index type: None

Training index frequency: None

Regressor parameters: {'objective': 'reg:squarederror', 'base_score': None, 'booster': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'enable_categorical': False, 'gamma': None, 'gpu_id': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': 0.1, 'max_delta_step': None, 'max_depth': 3, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': 123, 'reg_alpha': None, 'reg_lambda': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None}

Creation date: 2023-04-14 15:15:46

Last fit date: None

Skforecast version: 0.7.0

Python version: 3.8.8

Forecaster id: None

Grid search of hyperparameters and lags

================================

Regressor hyperparameters

param_grid = { 'n_estimators': [100, 500], 'max_depth': [3, 5, 10], 'learning_rate': [0.01, 0.1] }

Lags used as predictors

lags_grid = [72, [1, 2, 3, 23, 24, 25, 71, 72, 73]]

lags_grid = [7, 24, 48, 72, [1, 2, 3, 23, 24, 25, 71, 72, 73]]

lags_grid = [24, 30, 48, 72, 168, [1, 2, 3, 7, 23, 24, 25, 71, 72, 73, 168]]

results_grid = grid_search_forecaster( forecaster = forecaster, y = vendas_df2.loc[:end_validation, 'SALES'], exog = vendas_df2.loc[:end_validation, exog_variables],
param_grid = param_grid, lags_grid = lags_grid, steps = 2200, refit = False, metric = 'mean_squared_error', #custom_metric, # initial_train_size = int(len(data_train)), fixed_train_size = False, return_best = True, verbose = False )

Backtesting test data

=================================

metric, predictions = backtesting_forecaster( forecaster = forecaster, y = vendas_df2['SALES'], exog = vendas_df2[exog_variables], initial_train_size = len(vendas_df2.loc[:end_validation]), fixed_train_size = False, steps = 2200, refit = False,

interval = [0, 95],

                        metric             =  'mean_squared_error',   #custom_metric, 
                        verbose            = False
                  )

predictions.loc['2023-05-02':'2023-05-02']

pred

2023-05-02 00:00:00 328.447021

2023-05-02 01:00:00 349.223907

2023-05-02 02:00:00 350.060333

2023-05-02 03:00:00 350.060333

2023-05-02 04:00:00 356.852539

2023-05-02 05:00:00 356.852539

2023-05-02 06:00:00 356.852539

2023-05-02 07:00:00 356.852539

2023-05-02 08:00:00 356.852539

2023-05-02 09:00:00 371.91122

...

"Further more, if you ar including exogenous variables in fit, you should also provide them in the predict." - my final dataset has the same features as train dataset.

i must sleep! thank you! Jg

JorgeGomes72 commented 1 year ago

Hello, I want to thank you for all the help! I just want to leave here the final script with best parameters, fit and predict:

#model function:
def modelo_XGBRegressor(vendas_df2, tipo):    
    #exogen variables
    exog_variables = [column for column in vendas_df2.columns
                      if column.startswith(('YEAR','WEEK','WEEKDAY','HOUR','OPEN','FERIADO'))]

    #model
    forecaster = ForecasterAutoreg (
            regressor = XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
            colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
            gamma=0, gpu_id=-1, importance_type=None,
            interaction_constraints='', learning_rate=0.1, max_delta_step=0,
            max_depth=3, min_child_weight=1,
            monotone_constraints='()', n_estimators=100, n_jobs=8,
            num_parallel_tree=1, predictor='auto', random_state=123,
            reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
            tree_method='exact', validate_parameters=1, verbosity=None),
            lags=[1, 2, 3, 7, 23, 24, 25, 71, 72, 73, 168],
            weight_func = custom_weights
            )

    today_minus_15_a = pd.to_datetime((dt.today() - timedelta(days=7)).strftime("%Y-%m-%d %H:%M:%S")).replace(minute=0, second=0)
    today_minus_15_p = pd.to_datetime(today_minus_15_a) + timedelta(hours=1)

    #fit 
    forecaster.fit(y = vendas_df2.loc[:today_minus_15_a, tipo], exog = vendas_df2.loc[:today_minus_15_a, exog_variables])           

    #predictions
    predictions = forecaster.predict(steps=2200,  exog = vendas_df2.loc[today_minus_15_p:, exog_variables])   

    return predictions

##apply model 
predict_sales = modelo_XGBRegressor(vendas_df2, tipo = 'SALES')
.......

Thank you very much, next i will try SARIMA. JG