Nixtla / mlforecast

Scalable machine 🤖 learning for time series forecasting.
https://nixtlaverse.nixtla.io/mlforecast
Apache License 2.0
858 stars 84 forks source link

LightGBMCV: Ability to split CV object TS/data generation from LGBM Booster creation #356

Closed philipmassie closed 3 months ago

philipmassie commented 3 months ago

Description

During hyperparameter tuning with Optuna, LightGBMCV.setup becomes time consuming when training sets get large and folds get more numerous. I think it might be useful to be able to generate the data folds once, and then replace the internal boosters during optimization, updating them with relevant params.

Use case

Using cross validation during hyper parameter tuning is a good way to reduce overfitting but I cant figure out a way to use LightGBMCV with Optuna because the setup method takes longer with larger data sets, and I think most of the time is spent building the folds data which only needs to happen once.

If this already exists or is totally stupid, I apologise.

I came across the below approach somewhere which works well. I cant find the original link, but it's not mine. nesting the setup inside the objective requires that to process on each trial.

import pandas as pd
from mlforecast.lgb_cv import LightGBMCV
import optuna

df = pd.read_csv('https://datasets-nixtla.s3.amazonaws.com/air-passengers.csv', parse_dates=['ds'])

fit_params = {
    "df": df,
    "id_col": "unique_id",
    "time_col": 'ds', 
    "target_col": 'y'
    }

fixed_params = {
    'boosting_type': 'gbdt',
    'objective': 'rmse',
    'n_estimators': 10_000,
    'verbosity': -1
    }

def objective(trial):
    params = {
        "learning_rate":    trial.suggest_float('learning_rate', 1e-3, 1e-1, step=1e-3)
    }

    cv = LightGBMCV(freq="MS", lags=[12])
    cv.setup(
        params={**fixed_params, **params},
        n_windows=2,
        h=12,
        **fit_params,
        metric="rmse"
    )

    hist = []
    eval_every = 10
    early_stopping_evals = 20
    early_stopping_pct = 0.01

    for i in range(0, fixed_params['n_estimators'], eval_every):
        val = cv.partial_fit(eval_every) 
        trial.report(val, step=i)
        rounds = eval_every + i
        hist.append((rounds, val))
        if trial.should_prune():
            raise optuna.TrialPruned()
        if cv.should_stop(hist, early_stopping_evals, early_stopping_pct):
            print(f"Early stopping at round {rounds:,}")
            best_iter = cv.find_best_iter(hist, early_stopping_evals)
            trial.set_user_attr('n_estimators', best_iter)            
            break
    return val

dbname = f"sqlite:///optuna.sqlite3"
study_name = "testing_cv2"

study = optuna.create_study(
    storage=dbname,
    study_name=study_name,
    direction='minimize')

study.optimize(objective, n_trials=5, n_jobs=1)
print("Fin")
philipmassie commented 3 months ago

I have tried making copies of the cv object but deepcopy doesn't copy lightgbm entirely. cloudpickle similarly doesn't save the object in its entirety, and I'm not smart enough to figure out how to change that. I've also not found a way to 're-init' the cv object within the objective.

Reading this back, its sounding more like a question than a feature request. If you think I should remove it and post elsewhere please let me know.

philipmassie commented 3 months ago

Moving to discussion, aplogies