JoaquinAmatRodrigo / skforecast

Time series forecasting with machine learning models
https://skforecast.org
BSD 3-Clause "New" or "Revised" License
1.08k stars 122 forks source link

Mid/Long term forecasting #794

Open samuelefiorini opened 2 weeks ago

samuelefiorini commented 2 weeks ago

This is more of a general question than an issue. I apologize if this should be addressed elsewhere.

I am interested in performing mid- to long-term forecasting of a single time series and have been exploring the ForecasterAutoreg* class of models. The direct approach may not be suitable for my use case, as it requires training a large number of models and demands extensive historical training data to match the forecast horizon. Please correct me if I am mistaken.

The ForecasterAutoreg* models are based on the assumption that to predict the time series at t+1, the model needs to know its value at time t [^1]. However, for mid- to long-term forecasting, this often leads to error accumulation over the prediction horizon, resulting in forecasts that can become unreasonable. An effective backtesting procedure when the prediction horizon is extended is hard to achieve.

In the past, I have successfully used Greykite [^2] for similar tasks. From my understanding, Greykite does not strictly rely on autoregression (lagged extra regressors are optional). Instead, it trains a single sklearn regressor to predict all future time points while leveraging cyclical features [^3] to encode the forecast time (e.g., day, month, year, hour, etc). This approach allows for training the regressor on a relatively short horizon and predicting future points over arbitrarily long forecast horizons (this is also somewhat related to #764 ).

However, Greykite is a highly opinionated tool, which makes it difficult to customize beyond its intended use cases. For example, the final estimator in the pipeline must be one of the few supported, and altering preprocessing steps is not straightforward. In contrast, I appreciate the flexibility offered by skforecast.

Given this, I have the following questions:

  1. Is it currently possible to achieve a similar approach using skforecast?
  2. If not, are you fundamentally opposed to this kind of approach, or could it be a feature worth considering?

Sources:

JoaquinAmatRodrigo commented 2 weeks ago

Hi @samuelefiorini, Thanks for using skforecast.

You are right, medium to long term forecasting can be challenging when using only autoregressive (lag) features due to error accumulation. In this scenario, adding additional features whose values are known in the forecast horizon (calendar features, holidays, commercial events...) can play a key role.

In the skforecast framework, we call them exogenous features and they can be added to the model following the same idea you described for Greykite (see links below), but at least one autoregressive (lag) feature must be added.

https://skforecast.org/0.13.0/user_guides/exogenous-variables

https://cienciadedatos.net/documentos/py39-forecasting-time-series-with-skforecast-xgboost-lightgbm-catboost.html

My suggestion would be to train a model with the exogenous variables and at least the most relevant lag. If the lag is not needed, the model should give it a low weight. This is not a perfect solution (since you may be adding some irrelevant information), but it may give you useful results.

Regarding direct models, as you mentioned, training an individual model for each step becomes infeasible as the forecast horizon gets longer. However, if you only need the predicted value for the last few steps, then the first models can be pruned. This is a feature we are implementing and hope to include in the next release.

Hope this helps, Do not hesitate to share with us if you have other ideas that could solve this challenge, we are always welling to improve the library.

samuelefiorini commented 2 weeks ago

Thank you @JoaquinAmatRodrigo for your thoughtful response. I will definitely try to follow your suggestion.