Custom predictors are inefficient for window features

KishManani commented 8 months ago

The way that custom predictors are calculated in ForecasterAutoregCustom is potentially inefficient.

The features are computed in a for loop and appended to a list:

        X_train  = []
        y_train  = []

        for i in range(len(y) - self.window_size):

            train_index = np.arange(i, self.window_size + i)
            test_index  = self.window_size + i

            X_train.append(self.fun_predictors(y=y_values[train_index]))
            y_train.append(y_values[test_index])

Whilst this is very generic, it is inefficient for the most common use case of window features (e.g., using the rolling mean, standard deviation, etc. at various different window sizes). There are optimised ways for computing these rolling statistics which take the whole series as input and outputs all the features across time (not just for one time period as required by the current implementation of ForecasterAutoregCustom).

Pandas and Numpy have efficient implementations of rolling statistics. They can also be further speeded up with Numba (see the window ops stuff here).

Some options I think that would be helpful for skforecast could be 1) offer an additional way to derive features from the target variable so a user can supply a more efficient way of calculating window features 2) integrate window features in the API itself in a similar way that lag features are so that a more efficient way of calculating them can be included under the hood

On a sidenote, in my opinion, it is counterintuitive that ForecasterAutoreg supports a lags argument but ForecasterAutoregCustom requires the user to create the lag features in a custom function that the user must provide.

Best wishes, Kishan

JoaquinAmatRodrigo commented 8 months ago

Hi @KishManani I agree. We tried to implement the less restricted approach so that users could use whatever function to create predictors, given a window of values. However, in most cases, users may only need rolling stats, which are much more efficiently implemented when applied to the whole series instead of by window slices.

An important point to consider is that the function has to be applied at two levels:

When creating the training matrices:

X_train  = []
y_train  = []

for i in range(len(y) - self.window_size):

    train_index = np.arange(i, self.window_size + i)
    test_index  = self.window_size + i

    X_train.append(self.fun_predictors(y=y_values[train_index]))
    y_train.append(y_values[test_index])

In the prediction process (inside the recursive loop) :

for i in range(steps):
  X = self.fun_predictors(y=last_window).reshape(1, -1)
  if np.isnan(X).any():
      raise ValueError(
          "`fun_predictors()` is returning `NaN` values."
      )
  if exog is not None:
      X = np.column_stack((X, exog[i, ].reshape(1, -1)))

  with warnings.catch_warnings():
      # Suppress scikit-learn warning: "X does not have valid feature names,
      # but NoOpTransformer was fitted with feature names".
      warnings.simplefilter("ignore")
      prediction = self.regressor.predict(X)
      predictions[i] = prediction.ravel()[0]

  # Update `last_window` values. The first position is discarded and 
  # the new prediction is added at the end.
  last_window = np.append(last_window[1:], prediction)

I am going to do some benchmarking analysis with a function that has a dual behavior when applied to a data frame (case 1) and a NumPy array (case 2).

slavakx commented 3 months ago

I've just started working with your package and I'm already impressed. I prefer it over others like sktime and mlforecast. However, I suggest redesigning certain parts of the API, as the absence of window features is a significant drawback. You might consider looking at how mlforecast has integrated this functionality into their API.

JavierEscobarOrtiz commented 3 months ago

Hello @slavakx,

Thanks for your comment. Could you please provide a schema or a reproducible example with another library to better understand what you are expecting?

Thanks!

JoaquinAmatRodrigo commented 3 months ago

Hi @slavakx Thanks for your feedback. We are always trying to include useful functionalities in the library, so we will add windows features to the roadmap.

slavakx commented 3 months ago

Hello @slavakx,

Thanks for your comment. Could you please provide a schema or a reproducible example with another library to better understand what you are expecting?

Thanks!

Hi @JoaquinAmatRodrigo

The mlflorecast code looks like this and it supports different time features (see below my own implementation)

fcst = MLForecast(
    models=models,
    freq='D',
    lags=[7, 14],
    lag_transforms={
        1: [ExpandingMean()],
        7: [RollingMean(window_size=28)]
    },
    date_features=['dayofweek'],
    target_transforms=[Differences([1])],
)

Here is my own code for lag and window features that work pretty fast:

Lag

LAG_DAYS = list(range(1, 3))
for lag in LAG_DAYS:
    temp_df[f"temp_lag_{lag}"] = temp_df.groupby(lambda x: True )["temp"].transform(lambda x: x.shift(lag))

Moving Average (Rolling Mean)

moving_average_days = [2]
day_lags = [1,2]
for i in moving_average_days:
    for day_lag in day_lags:
        print(f"Moving average period: {i} with shift day: {day_lag}")
        temp_df[f"moving_average_{i}_lag_{day_lag}"] = temp_df.groupby(lambda x: True)["temp"].transform(lambda x: x.shift(day_lag).rolling(i).mean())

Expanding Mean

day_lags = [1,2]
for day_lag in day_lags:
    print(f"Expaning mean period with shift day: {day_lag}")
    temp_df[f"expanding_mean_lag_{day_lag}"] = temp_df.groupby(lambda x: True)["temp"].transform(lambda x: x.shift(day_lag).expanding().mean())

Exponentially Weighted Moving Everage

alpha = 0.5 

day_lags = [1,2]
for day_lag in day_lags:
    print(f"EWMA with shift day: {day_lag}")
    temp_df[f"ewma_lag_{day_lag}"] = temp_df.groupby(lambda x: True)["temp"].transform(lambda x: x.shift(day_lag).ewm(alpha=alpha, adjust=False).mean())

Seasonal Rolling Mean

import pandas as pd 
date_range = pd.date_range(start='2023-01-01', end='2023-02-01', freq='D') 
df = pd.DataFrame({'value': range(len(date_range))}, index=date_range) 
df["dayofweek"] = df.index.dayofweek
df.head()

seasonal_average_days = [2]
day_lags = [1,2]
for i in seasonal_average_days:
    for day_lag in day_lags:
        print(f"Seasonal moving average period: {i} with shift day: {day_lag}")
        df[f"seasonal_moving_average_{i}_lag_{day_lag}"] = df.groupby("dayofweek")["value"].transform(lambda x: x.shift(day_lag).rolling(i).mean())

JoaquinAmatRodrigo commented 3 months ago

Thanks @slavakx for the examples.

Some of these windows features can be used in skforecast with the ForecasterAutoregCustom, for example here:


# Custom function to create predictors
# ==============================================================================
def create_predictors(y):
    """
    Create first 10 lags of a time series.
    Calculate moving average with window 20.
    """

    lags = y[-1:-11:-1]     # window size needed = 10
    mean = np.mean(y[-20:]) # window size needed = 20
    predictors = np.hstack([lags, mean])

    return predictors

# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoregCustom(
                 regressor       = RandomForestRegressor(random_state=123),
                 fun_predictors  = create_predictors,
                 name_predictors = [f'lag {i}' for i in range(1, 11)] + ['moving_avg_20'],
                 window_size     = 20 # window_size needed by the mean is the most restrictive one
             )

forecaster.fit(y=data_train)
forecaster

We are currently working on improving the flexibility and speed.

More detailed information here: https://skforecast.org/0.12.1/user_guides/custom-predictors

JoaquinAmatRodrigo / skforecast