JoaquinAmatRodrigo / skforecast

Time series forecasting with machine learning models
https://skforecast.org
BSD 3-Clause "New" or "Revised" License
996 stars 113 forks source link

Why do i have faster runtime when using more frequent refits #639

Closed fredn19 closed 4 months ago

fredn19 commented 4 months ago

I'm currently working on a project in which i am trying to predict the hourly energy consumption of a household. Im trying to predict 24 hours ahead but with af 12 hour gap so the prediction time is actually 36 hours.

However as im looking to predict an entire year on hourly basis im playing around with the refit parameter. Here i have found that using refit=1, meaning that the model refits every 24 hours, is way faster than using refit=7, meaning that the model refits once a week. This does not make sense to me and am therefore interesting in hearing if have understood the mechanisms correctly?

The difference in prediction time for a year is like 2 min for refit = 1 (True) and 30 min for refit = 7

Here is my backtester:

metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data_short['electricity_cons'],
                          exog                  = data_short[num_cols+remainder_cols],
                          steps                 = 24,
                          metric                = 'mean_absolute_percentage_error',
                          initial_train_size    = len(train_short)+12,
                          fixed_train_size      = True,
                          gap                   = 12,
                          allow_incomplete_fold = True,
                          refit                 = 1,
                          interval              = [5, 95],
                          n_jobs                = 'auto',
                          n_boot                = 500,
                          verbose               = False,
                          show_progress         = True
                      )

And here is the forecaster:

def custom_predictors(y):
    lags = y[[-1,-2,-3,-4,-5,-23,-24,-25,-47,-48,-49]]     # window size needed = 49
    mean_24 = np.mean(y[-24:]) # window size needed = 24
    mean_48 = np.mean(y[-48:]) # window size needed = 48
    predictors = np.hstack([lags, mean_24, mean_48])

    return predictors
forecaster = ForecasterAutoregCustom(
                 regressor         = SVR(),
                 fun_predictors    = custom_predictors,
                 window_size       = 49,
                 transformer_y     = StandardScaler(),
                 transformer_exog  = Transform_exog,
                 name_predictors = [f'lag {i}' for i in range(1, 6)] + ['lag 23','lag 24','lag 25','lag 47','lag 48','lag 49','moving_avg_24','moving_avg_48'],
             )
JavierEscobarOrtiz commented 4 months ago

Hello @fredn19,

Thanks for opening the topic. You are right, normally you would expect less execution time for a smaller number of refits.

The reason why the time for refit=True was 2 min and for refit=7 was 30 min is parallelization. While refit=True allows parallelization, when refit is an integer other than 1, parallelization is not possible because the various model fits must follow a logical order. This configuration is automated thanks to n_jobs='auto'.

What I can suggest to reduce time when refit=7:

By the way, you might be interested in reading this article about metrics when building a forecasting model:

https://towardsdatascience.com/forecast-kpi-rmse-mae-mape-bias-cdc5703d242d

Hope it helps!

Javier

fredn19 commented 4 months ago

Thank you so much for the explanation Javier!

It makes way more sense to me now :-)

I also see your point about the metric, thanks for the heads up!