Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
906 stars 90 forks source link

Exogenous features with global models #432

Open gasperpetelin opened 3 weeks ago

gasperpetelin commented 3 weeks ago

Description

I have two questions regarding how to handle covariates in Nixtla's MLForecast when performing global forecasting. Although there is rather extensive documentation on this, I am still not exactly sure how to properly train the model with covariate lags and then subsequently pass data to it when forecasting.

Let's assume I want to create a direct forecasting model with a forecasting horizon of 3. I also have a dataframe with 1 series. This series has 1 past covariate pc, 1 future covariate fc, and 1 static covariate sc. When training forecasting model, I would like to use lag 2 for y, a past covariate lag of 2, and future covariates are known up to the forecasting horizon. Train dataframe df looks something like this:

┌───────────┬─────┬─────────────────────┬─────┬─────┬─────┐
│ unique_id ┆ y   ┆ ds                  ┆ pc  ┆ fc  ┆ sc  │
│ ---       ┆ --- ┆ ---                 ┆ --- ┆ --- ┆ --- │
│ i64       ┆ i64 ┆ datetime[μs]        ┆ i64 ┆ i64 ┆ i64 │
╞═══════════╪═════╪═════════════════════╪═════╪═════╪═════╡
│ 0         ┆ 0   ┆ 2020-01-01 00:00:00 ┆ 50  ┆ 100 ┆ 12  │
│ 0         ┆ 1   ┆ 2020-01-02 00:00:00 ┆ 51  ┆ 101 ┆ 12  │
│ 0         ┆ 2   ┆ 2020-01-03 00:00:00 ┆ 52  ┆ 102 ┆ 12  │
│ 0         ┆ 3   ┆ 2020-01-04 00:00:00 ┆ 53  ┆ 103 ┆ 12  │
│ 0         ┆ 4   ┆ 2020-01-05 00:00:00 ┆ 54  ┆ 104 ┆ 12  │
│ 0         ┆ 5   ┆ 2020-01-06 00:00:00 ┆ 55  ┆ 105 ┆ 12  │
│ 0         ┆ 6   ┆ 2020-01-07 00:00:00 ┆ 56  ┆ 106 ┆ 12  │
│ 0         ┆ 7   ┆ 2020-01-08 00:00:00 ┆ 57  ┆ 107 ┆ 12  │
│ 0         ┆ 8   ┆ 2020-01-09 00:00:00 ┆ 58  ┆ 108 ┆ 12  │
│ 0         ┆ 9   ┆ 2020-01-10 00:00:00 ┆ 59  ┆ 109 ┆ 12  │
└───────────┴─────┴─────────────────────┴─────┴─────┴─────┘

Based on issue #328, someone mentioned that past covariates should be shifted in the following way:

If you want to turn a feature into a historic covariate in MLForecast, you can just use a lag that is longer than your forecasting horizon.

But is this really the case? Based on the code below, the past covariate shift has to be greater than 0 but not necessarily larger than the forecasting horizon. Or is that true only for recursive forecasting?

If I understand correctly, the following code would correctly handle past, future, and static covariates.

def generate_covariates(df):
  return df.with_columns([
    pl.col('pc').shift(1).alias('pc_lag1'),
    pl.col('pc').shift(2).alias('pc_lag2'),
    pl.col('fc').shift(-1).alias('fc_lag-1'),
    pl.col('fc').shift(-2).alias('fc_lag-2'),
    pl.col('fc').shift(0).alias('fc_lag0'),
    pl.col('fc').shift(1).alias('fc_lag1'),
    pl.col('fc').shift(2).alias('fc_lag2'),
  ]).drop(['pc', 'fc'])

fcst = MLForecast(
    models=LinearRegression(),
    freq='1d',
    lags=[1, 2],
)

df_past_future = generate_covariates(df).drop_nulls()
fcst.fit(df_past_future, static_features=['sc'], max_horizon=3)

Is this the correct way to preprocess dataframe when training a model? Is there a difference between direct and recursive forecasting and how covariates should be handled?

The second question I have is about the simultaneous use of new_df and X_df. I couldn’t find any examples in the documentation where both parameters are used together. When calling predict, what exactly should be passed to new_df and X_df? For example, let's say I am using the same dataset for making predictions:

past = generate_covariates(df).head(5)
future = generate_covariates(df).tail(5).head(3)
fcst.predict(new_df=past, X_df=future.drop(['sc', 'y']), h=3)

Why, in this case, can’t past covariates (pc_lag1, pc_lag2) be dropped in X_df? From the documentation, I would assume that only future covariates need to be passed into X_df. I’m a bit concerned that using incorrect arguments for the exogenous series might cause data leakage without me realizing it. Is there any utility function in Nixtla that helps handle this? One could easily introduce data leakage without extreme care in handling covariates.

I believe it would be helpful to add an example in How-to Guides > Exogenous Features on handling all three types of covariates simultaneously with the new_df and X_df interface, along with clear guidance on how each type of covariate should be shifted to prevent data leakage. This seems to be one of the most common scenarios in forecasting (using all three types of covariates with global models). Alternatively, if it doesn’t already exist, it might be valuable to add a utility function where a user can input a dataframe and specify the desired lags for future and past covariates, with the function then generating the correctly formatted dataframe. I think transform_exog is the closest thing that does this but one still has to be careful how past/future covariates are shifted.

Any help in clarifying this would be greatly appreciated. Additionally, I’d be more than willing to help add an example to the documentation if it would be beneficial to others?

Link

No response

jmoralez commented 3 weeks ago

Hey @gasperpetelin, thanks for using mlforecast.

the past covariate shift has to be greater than 0 but not necessarily larger than the forecasting horizon. Or is that true only for recursive forecasting?

For direct forecasting a positive shift is enough, the restriction on being greater or equal to the horizon is only for recursive forecasting.

Is this the correct way to preprocess dataframe when training a model? Is there a difference between direct and recursive forecasting and how covariates should be handled?

The negative shifts don't look ok, the lags should always be in the past. You should instead split your features into train/future and provide the future values to predict through X_df.

Both direct and recursive handle the features in the same way, the difference comes at the forecast step, because direct forecasting only needs the features at the timestamp following the train end, and the recursive needs from 1 to h timestamps past the train end.

When calling predict, what exactly should be passed to new_df and X_df?

new_df is meant for transfer learning only, so that's probably not what you're looking for. In both cases X_df should hold the future values of the exogenous features, so it should have id, time, exog1, exog2, ... where the times start immediately after the last training time for each id. If you're not doing transfer learning you only need to provide X_df.

Why, in this case, can’t past covariates (pc_lag1, pc_lag2) be dropped in X_df?

There's no distinction in mlforecast between past and future covariates, there are only static and dynamic. So every feature that is not static is expected to be in X_df.

I’m a bit concerned that using incorrect arguments for the exogenous series might cause data leakage without me realizing it

I don't think leakage would be a problem here, you just have to setup your features in the same way you expect them to be available in your forecasting step. So if by the time you want to make the forecast you know some features at that timestamp but others 2 timestamps behind you should set up the lags accordingly.

Feel free to post some follow up questions and we'd welcome a contribution to the documentation to make it clearer.