Open KishManani opened 8 months ago
Hi @KishManani I agree. We tried to implement the less restricted approach so that users could use whatever function to create predictors, given a window of values. However, in most cases, users may only need rolling stats, which are much more efficiently implemented when applied to the whole series instead of by window slices.
An important point to consider is that the function has to be applied at two levels:
X_train = []
y_train = []
for i in range(len(y) - self.window_size):
train_index = np.arange(i, self.window_size + i)
test_index = self.window_size + i
X_train.append(self.fun_predictors(y=y_values[train_index]))
y_train.append(y_values[test_index])
for i in range(steps):
X = self.fun_predictors(y=last_window).reshape(1, -1)
if np.isnan(X).any():
raise ValueError(
"`fun_predictors()` is returning `NaN` values."
)
if exog is not None:
X = np.column_stack((X, exog[i, ].reshape(1, -1)))
with warnings.catch_warnings():
# Suppress scikit-learn warning: "X does not have valid feature names,
# but NoOpTransformer was fitted with feature names".
warnings.simplefilter("ignore")
prediction = self.regressor.predict(X)
predictions[i] = prediction.ravel()[0]
# Update `last_window` values. The first position is discarded and
# the new prediction is added at the end.
last_window = np.append(last_window[1:], prediction)
I am going to do some benchmarking analysis with a function that has a dual behavior when applied to a data frame (case 1) and a NumPy array (case 2).
I've just started working with your package and I'm already impressed. I prefer it over others like sktime and mlforecast. However, I suggest redesigning certain parts of the API, as the absence of window features is a significant drawback. You might consider looking at how mlforecast has integrated this functionality into their API.
Hello @slavakx,
Thanks for your comment. Could you please provide a schema or a reproducible example with another library to better understand what you are expecting?
Thanks!
Hi @slavakx Thanks for your feedback. We are always trying to include useful functionalities in the library, so we will add windows features to the roadmap.
Hello @slavakx,
Thanks for your comment. Could you please provide a schema or a reproducible example with another library to better understand what you are expecting?
Thanks!
Hi @JoaquinAmatRodrigo
The mlflorecast code looks like this and it supports different time features (see below my own implementation)
fcst = MLForecast(
models=models,
freq='D',
lags=[7, 14],
lag_transforms={
1: [ExpandingMean()],
7: [RollingMean(window_size=28)]
},
date_features=['dayofweek'],
target_transforms=[Differences([1])],
)
Here is my own code for lag and window features that work pretty fast:
LAG_DAYS = list(range(1, 3))
for lag in LAG_DAYS:
temp_df[f"temp_lag_{lag}"] = temp_df.groupby(lambda x: True )["temp"].transform(lambda x: x.shift(lag))
moving_average_days = [2]
day_lags = [1,2]
for i in moving_average_days:
for day_lag in day_lags:
print(f"Moving average period: {i} with shift day: {day_lag}")
temp_df[f"moving_average_{i}_lag_{day_lag}"] = temp_df.groupby(lambda x: True)["temp"].transform(lambda x: x.shift(day_lag).rolling(i).mean())
day_lags = [1,2]
for day_lag in day_lags:
print(f"Expaning mean period with shift day: {day_lag}")
temp_df[f"expanding_mean_lag_{day_lag}"] = temp_df.groupby(lambda x: True)["temp"].transform(lambda x: x.shift(day_lag).expanding().mean())
alpha = 0.5
day_lags = [1,2]
for day_lag in day_lags:
print(f"EWMA with shift day: {day_lag}")
temp_df[f"ewma_lag_{day_lag}"] = temp_df.groupby(lambda x: True)["temp"].transform(lambda x: x.shift(day_lag).ewm(alpha=alpha, adjust=False).mean())
import pandas as pd
date_range = pd.date_range(start='2023-01-01', end='2023-02-01', freq='D')
df = pd.DataFrame({'value': range(len(date_range))}, index=date_range)
df["dayofweek"] = df.index.dayofweek
df.head()
seasonal_average_days = [2]
day_lags = [1,2]
for i in seasonal_average_days:
for day_lag in day_lags:
print(f"Seasonal moving average period: {i} with shift day: {day_lag}")
df[f"seasonal_moving_average_{i}_lag_{day_lag}"] = df.groupby("dayofweek")["value"].transform(lambda x: x.shift(day_lag).rolling(i).mean())
Thanks @slavakx for the examples.
Some of these windows features can be used in skforecast with the ForecasterAutoregCustom, for example here:
# Custom function to create predictors
# ==============================================================================
def create_predictors(y):
"""
Create first 10 lags of a time series.
Calculate moving average with window 20.
"""
lags = y[-1:-11:-1] # window size needed = 10
mean = np.mean(y[-20:]) # window size needed = 20
predictors = np.hstack([lags, mean])
return predictors
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoregCustom(
regressor = RandomForestRegressor(random_state=123),
fun_predictors = create_predictors,
name_predictors = [f'lag {i}' for i in range(1, 11)] + ['moving_avg_20'],
window_size = 20 # window_size needed by the mean is the most restrictive one
)
forecaster.fit(y=data_train)
forecaster
We are currently working on improving the flexibility and speed.
More detailed information here: https://skforecast.org/0.12.1/user_guides/custom-predictors
The way that custom predictors are calculated in
ForecasterAutoregCustom
is potentially inefficient.See the relevant code snippet.
The features are computed in a for loop and appended to a list:
Whilst this is very generic, it is inefficient for the most common use case of window features (e.g., using the rolling mean, standard deviation, etc. at various different window sizes). There are optimised ways for computing these rolling statistics which take the whole series as input and outputs all the features across time (not just for one time period as required by the current implementation of
ForecasterAutoregCustom
).Pandas and Numpy have efficient implementations of rolling statistics. They can also be further speeded up with Numba (see the window ops stuff here).
Some options I think that would be helpful for
skforecast
could be 1) offer an additional way to derive features from the target variable so a user can supply a more efficient way of calculating window features 2) integrate window features in the API itself in a similar way that lag features are so that a more efficient way of calculating them can be included under the hoodOn a sidenote, in my opinion, it is counterintuitive that
ForecasterAutoreg
supports alags
argument butForecasterAutoregCustom
requires the user to create the lag features in a custom function that the user must provide.Best wishes, Kishan