How to forecast time series with growing trend using regression tree models in skforecast

bluecrocod commented 1 year ago

Fist of all, congrulations on the amazing job you are doing with this library.

My question is how to use skforecast and models such as Random Forest or XGBoost in case the expected future values are higher than the train data (due to the fact that the data have a growing trend)

When I try to predict future values, the result is a horizontal straight line.

Is it possible to use skforecast in this scene? I guess I would have to do some pre-transformation to remove the trend, then do the forecast and finally include the trend again, however I couldn't find any examples to deal with this kind of problem, and I do not know how to proceed.

JoaquinAmatRodrigo commented 1 year ago

Hi @bluecrocod Thanks for using skforecast! Dealing with the trend component can indeed be challenging when using certain machine learning models for forecasting. Here are some suggestions:

In addition to using lagged values as predictors (ForecasterAutoreg), consider including a moving average and the delta between observations (ForecasterAutoregCustom). This can provide the model with more information about the trend patterns and help improve its forecasting accuracy.
To further improve the model's ability to capture the rising trend over time, an exogenous variable representing an index since the beginning of the series is included. By including this index, the model can learn how the trend is evolving and adjust its forecasts accordingly.

The solution you suggest may also work, but would require additional coding.

Let us know if any of these methods work!

bluecrocod commented 1 year ago

Thank you very much for your answer Joaquin,

I improved my current forecast by using ForecasterAutoregCustom as you recommended.

However, the problem when I use a tree regression model has not been solved. Please, take a look at the following example:

A monotonically increasing series (Mine has only trend, but this is an example).
The Random Forest model cannot predict values higher than the ones in the training data.

As you can see, the predictions are a constant value (horizontal straight line). Is there an "easy" way to deal with this issue in skforecast other than removing trends? For these kind of problems, should I try to use only Ridge or Lasso models?

Dates train : 2021-01-03 00:00:00 --- 2021-03-15 00:00:00 (n=72) Dates val : 2021-03-16 00:00:00 --- 2021-04-15 00:00:00 (n=31) Dates test : 2021-04-16 00:00:00 --- 2021-05-15 00:00:00 (n=30)

JoaquinAmatRodrigo commented 1 year ago

Hi @bluecrocod

You're absolutely right, a major limitation of tree-based models when it comes to prediction is their inability to predict values higher than the maximum observed during training (extrapolation). On the other hand, linear models such as Ridge or Lasso do not face this problem and can provide a reasonable approximation in such cases. The use of these models is currently the only alternative that skforecast can offer.

However, it is very important to address this limitation when dealing with real business scenarios. It would be fantastic to explore new approaches to overcome this challenge. I'm open to discussing ideas and even collaborating on the implementation of a potential solution. So please feel free to share your thoughts and let's dive into this exciting topic.

As you mentioned, one possibility could be to automate the process of isolating the trend, training a linear model for the trend and a tree-based model for the remaining signal. Then finally combine them.

JoaquinAmatRodrigo commented 1 year ago

This implementation of tree-base models by @cerlymarco may be a great approach! https://github.com/cerlymarco/linear-tree

bluecrocod commented 1 year ago

Thank you for this information. I will keep working trying to follow this approach, and I will let you know if I improve my forecast.

edgBR commented 1 year ago

Hi @bluecrocod,

I suggest to implement a difference transformer over your time series. In a classic scikit learn pipeline manner this will look more or less like this:


class DifferenceTransformer(BaseEstimator, TransformerMixin):
    """Custom differentiation transformation that follows the Sklearn interface to be used in a Pipeline."""

    def __init__(self, num_observed_points: int):
        """

        Args:
            num_observed_points: The number of observed points from the past that are used to forecast the future.
        """

        self.num_observed_points = num_observed_points
        self.X_train = None

    def fit(self, X: Optional[Union[pd.DataFrame, pd.Series]], y: Optional[Union[pd.DataFrame, pd.Series]] = None) -> "DifferenceTransformer":
        """In this case by fitting we just keep a copy of the train data to reconstruct the time series on the inverse_transform() operation."""

        logger.debug(f"DifferenceTransformer X (fit): {X.shape}")
        if y is not None:
            logger.debug(f"DifferenceTransformer y (fit): {y.shape}")

        self.X_train = X.copy()

        return self

    def transform(self, X: Optional[Union[pd.DataFrame, pd.Series]]) -> pd.DataFrame:
        """Apply the differentiation operation."""

        X_diff = X.diff(periods=1)
        X_diff = X_diff.iloc[1:]

        logger.debug(f"DifferenceTransformer (transform): {X_diff.shape}")

        return X_diff

    def inverse_transform(self, X: Optional[Union[np.ndarray, pd.DataFrame, pd.Series]]) -> pd.DataFrame:
        """Inverse the differentiation operation."""

        if isinstance(X, np.ndarray):
            if len(X.shape) == 1:
                X = pd.Series(X)
            elif len(X.shape) == 2:
                X = pd.DataFrame(X)
            else:
                raise RuntimeError(f"Shape of X not supported: {X.shape}")

        if len(X) == len(self.X_train) - 1:
            first_values = self.X_train.iloc[[0]]
        else:
            first_values = self.X_train.iloc[[-(self.num_observed_points + 1)]]

        X_reversed = pd.concat([first_values, X])
        X_reversed = X_reversed.cumsum(axis=0)

        logger.debug(f"DifferenceTransformer (inverse_transform): {X_reversed.shape}")

        return X_reversed

Basically this transformer will remove the trend of your time series and then you will model the remainder time series with any tree model. Of course this is not the same as estimating the future trend (as the differencing assumes the trend will be the same) but is a good enough start.

Probably self.num_observed_point can be taken from horizon if we use some sort of class inheritance method but I am still familiarizing with the API.

Let me know if you make it work or not!

BR E

JoaquinAmatRodrigo commented 1 year ago

Hi @bluecrocod and @edgBR, We are currently developing a similar solution (skforecast version 0.10.x or higher) that incorporates a new differentiation parameter into the forecaster.

differentiation : int, default `None`
        Order of differencing applied to the time series before training the forecaster.
        If `None`, no differencing is applied. The order of differentiation is the number
        of times the differencing operation is applied to a time series. Differencing
        involves computing the differences between consecutive data points in the series.
        Differentiation is reversed in the output of `predict()` and `predict_interval()`.

This is achieved by making internal use of a new transformer named skforecast.preprocessing.TimeSeriesDifferentiator. It is worth noting that the entire differentiation process has been automated and its effects are seamlessly reversed during the prediction phase. This ensures that the resulting forecasted values are in the original scale of the time series data.

Please, be aware this is still experimental.

JavierEscobarOrtiz commented 1 year ago

Hello,

Differentiation is now possible with skforecast 0.10.0, check:

https://skforecast.org/latest/faq/time-series-differentiation

and

Modelling time series trend with tree based models

Hopt it helps!

JoaquinAmatRodrigo / skforecast

How to forecast time series with growing trend using regression tree models in skforecast #488