Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
842 stars 80 forks source link

how to evaluate every n iters whilst also enabling toggle between recursive and direct forecasting? #390

Closed ardsnijders closed 4 weeks ago

ardsnijders commented 1 month ago

Hello,

I'm using an LGBM-based forecasting model and I want to apply cross validation, where I also want to evaluate every n iterations in order to plot learning curves for my loss metrics of interest. The way I see it, this is possible with the LightGBMCV method, as it offers built-in support for evaluating every n iterations. However, it does not seem possible to do this whilst using a direct forecasting approach, which is an option I want to keep open for the future.

Conversely, the more general MLForecast.cross_validation() method does seem to support a toggle between recursive and direct forecasting, but it does not offer support for intermittent, every-n-iters-evaluation using custom metrics.

Is it possible to have both, and how can I accomplish this? In general, I think it would be nice if the .fit() method offers built-in early stopping / eval-every-n support.

Thanks!

jmoralez commented 1 month ago

Hey. The LightGBMCV is specialized for recursive forecasting because that's hard to do manually, since you have to train a few iterations and predict the full horizon recursively.

If you want to do direct forecasting you can define an sklearn model that does the CV within fit (you'd also need to decide what to do with predict if you want to do that as well) and stores the results. Keep in mind that it'd be one of the most expensive trainings possible, since you'll be performing CV max_horizon times. Having said that, I believe this achieves what you want:

import lightgbm as lgb
import numpy as np
import pandas as pd
from mlforecast import MLForecast
from utilsforecast.data import generate_series
from utilsforecast.feature_engineering import fourier

class MyModel(lgb.LGBMRegressor):
    def fit(self, X, y):
        last_date = X['orig_times'].max()
        n_folds = 3
        valid_size = 10
        all_idxs = np.arange(y.size)
        folds = []
        offset = valid_size * pd.offsets.Day()
        for i in range(n_folds):
            train_end = last_date - (n_folds - i) * offset
            valid_end = train_end + offset
            train_mask = X['orig_times'].le(train_end)
            valid_mask = X['orig_times'].gt(train_end) & X['orig_times'].le(valid_end)
            folds.append((all_idxs[train_mask], all_idxs[valid_mask]))
        ds = lgb.Dataset(X.drop(columns='orig_times'), y)
        params = self.get_params()
        n_iter = params.pop('n_estimators')
        self.eval_result_ = lgb.cv(params, ds, num_boost_round=n_iter, folds=folds)
        return self

    def predict(self, X):
        raise NotImplementedError

series = generate_series(5, equal_ends=True)
train, _ = fourier(series, freq='D', season_length=7, k=2)

# the original times aren't used as a feature and thus aren't passed to the model
# this makes sure we do pass them to the model in order to compute the splits
def orig_times(times):
    return times

mlf = MLForecast(
    models={'my_model': MyModel(verbosity=-1)},
    freq='D',
    date_features=[orig_times],
)
mlf.fit(train, static_features=[], max_horizon=5)
# mlf.models_['my_model'] is a list of size 5 with each of the trained models
mlf.models_['my_model'][0].eval_result_
ardsnijders commented 1 month ago

Hi Jose, thanks for your reply. That seems fair, I hadn't considered the extra cost of doing CV for direct forecasting.

Is there an alternative but cheaper heuristic to determine the optimal number of iterations (as that is essentially the purpose of doing CV, as far as I'm concerned) whilst retaining the option to both evaluate and log metrics at fixed intervals whilst also enabling an easy toggle between recursive and direct?

For instance, by training either a direct or recursive model using MLForecast.fit() for e.g. 1K iterations, calling a manual CV routine, and then continuing training for another 1K, and so forth?

jmoralez commented 1 month ago

I think you can do something similar to my example but just computing a single split and using lgb.train or lgb.LGBMRegressor.fit providing a validation set and a callback to record the metrics.

ardsnijders commented 1 month ago

Thanks. I will probably refactor to an lgb-based setup in that case. Thanks for the comprehensive and quick replies.

jmoralez commented 4 weeks ago

Closing since the issue seems to have been solved. Feel free to reopen if you have follow-up questions.