Unable to make forecasts on new data

denwolff commented 4 months ago

What happened + What you expected to happen

I'm trying to make new predictions for data the model hasn't seen yet, using 'fcst.predict' with the 'new_df' parameter. This works for me with the M5 example dataset, but not with my own data, here I get the error: KeyError: "['Ain', 'Bin', 'Cin'] not in index" (Ain, Bin and Cin are my three features). The data I trained the model with and my new data have exactly the same structure and dtypes: input The data don't have any missing values.

Yet, somehow 'predict' complains that there are some features missing (unclear for me whether in the train set or the new data): predictions_new_data = fcst.predict(h=FORECAST_HORIZON_TEST, new_df=X_NEW_TIMESERIES)

error

Versions / Dependencies

Python 3.10.11 mlforecast 0.13.0

Reproduction script

lags=[1, 2, 3, 4, 5, 10, 50, 100, 129]

fcst = MLForecast( models=lgb.LGBMRegressor(random_state=0, verbosity=-1), freq=1, lags=lags, lag_transforms={ 1: [expanding_mean], 100: [RollingMean(window_size=100)], }, target_transforms=[Differences([24])] ) fcst.fit(X_TRAIN, static_features=[])

predictions_new_data = fcst.predict(h=FORECAST_HORIZON_TEST, new_df=X_NEW_TIMESERIES)

Issue Severity

High: It blocks me from completing my task.

jmoralez commented 4 months ago

Hey @denwolff, thanks for using mlforecast. The new_df is like the new "training set", it will be used to extract the lags, times, etc. If you have exogenous features you also have to provide X_df with the future values of the exogenous features (for the times after new_df).

denwolff commented 4 months ago

Hi, thank you very much for your response. Sorry for me there are several points of confusion. First of all, when I try with the M3 or M5 dataset, calling '.predict' with the new_df parameter works without providing an argument for X_df:

M5 prediction

I don't understand why it is necessary for my data then to provide X_df.

Second, concerning the phrase "If you have exogenous features you also have to provide X_df with the future values of the exogenous features (for the times after new_df)": So, what I would like to do is, use the model that has already been trained, to predict an entirely new time series new_df that the model has not seen before. So no further training with any new dataset should be necessary? Why do I need to train the model again? Or is it that it's not possible to use the pretrained model on an entirely new time series without having trained it on its first couple of samples? (But then wouldn't understand why it works for the M3/M5 data)

Then, the future values of the exogenous features after new_df would mean the future values of the future values that I want to predict? I must be misunderstanding something.

(I was expecting that the issue is somehow related to the way my dataframes are structured, though I could find no difference to the M3/M5 data for which it worked)

jmoralez commented 4 months ago

What are you setting as static_features in those cases? If you're not setting anything your features are being interpreted as static and thus you don't need to provide the future values.
It won't retrain the models, but since the model uses lag-based features we need to get them from somewhere
By future values I mean whatever is in your forecasting horizon, so you need to provide the data that you have for the target so that we can compute the lag features and then the values of the exogenous in the forecast horizon.

denwolff commented 4 months ago

Thank you, I realized that in the M3/M5 dataset examples I had not set static_features=[] and therefore the exogenous features had actually not been used. I understood now - in case I have exogenous features, for df_new I need to give some samples of the beginning (target + exogenous features) and for X_df the exogenous features of the features samples I want to predict.

Thank you very much for your help!

Nixtla / mlforecast