Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
789 stars 74 forks source link

Cross validation with prediction_intervals and in-sample predictions enabled lacks folds #327

Closed adriaanvh1 closed 3 months ago

adriaanvh1 commented 3 months ago

What happened + What you expected to happen

Bug When performing cross validation with prediction intervals enabled and setting fitted=True to be able to retrieve the in-sample predictions, only the last fold is included in the in-sample data. Expected behaviour All folds should be present in the output Useful information This is due to the nested cross-validation to fit the conformal prediction intervals inside the explicitly called cross_validation, more specifically, the in-sample predictions are reset to an empty list here: https://github.com/Nixtla/mlforecast/blob/5b08a3ef3d2e448916b6aa74bb8e74814090a2cf/mlforecast/forecast.py#L844 The last fold will run cross_validation to fit conformal prediction, resetting the attribute, after which the in-sample predictions of the last fold are appended, resulting in only in-sample predictions of the last fold.

Versions / Dependencies

mlforecast: 0.12.0 datasetsforecast: 0.0.8 lightgbm: 4.3.0

python: 3.10.14 OS: macOS Monterey v12.6

Reproduction script

from mlforecast import MLForecast
from datasetsforecast.m5 import M5
from mlforecast.utils import PredictionIntervals
from lightgbm import LGBMRegressor

# Get data and take subset
target, _exogenous, _static_vars = M5.load("./data_dir")
unique_ids = target["unique_id"].unique()[::100]
target = target[target["unique_id"].isin(unique_ids)]

# Define model
fcst = MLForecast(
    models=[LGBMRegressor(n_estimators=2)],
    freq="D",
    lags=[1],
)

# Get in-sample cross-validation predictions
h = 14
_cv_results = fcst.cross_validation(
    df=target,
    n_windows=4,
    prediction_intervals=PredictionIntervals(n_windows=2, h=h),
    level=[80],
    h=h,
    fitted=True,
)
cv_results_insample = fcst.cross_validation_fitted_values()

num_folds = cv_results_insample["fold"].nunique()
print(num_folds)  # Should be 4, is 1

Issue Severity

High: It blocks me from completing my task.