[BUG] refit_every_n_windows doesn't reduce training time for long time series

Bug Report Checklist

[x] I provided code that demonstrates a minimal reproducible example.
[x] I confirmed bug exists on the latest mainline of AutoGluon via source install.
[x] I confirmed bug exists on the latest stable version of AutoGluon.

Describe the bug I have a time series dataset with a length of one year and a resolution of one minute (so 3652460 datapoints). If I fit a TimeSeriesPredictor to forecast a window with prediction_length=15, training is very fast. But I want to evaluate it on a full day (keeping the prediction_length=15), so I set num_val_windows=24*4 but also refit_every_n_windows=None to keep a similar training runtime (since models will only be trained once and then tested on all validation windows). In the second case however, training takes vastly longer, which I wouldn't expect. Is this a bug?

Essentially, the refit_every_n_windows option doesn't seem to do anything for this dataset. Interestingly though, I've also tested this for a dataset with only a day of information (in minute resolution, so 24*60 datapoints), here refit_every_n_windows=None did seem to reduce training runtime significantly, essentially giving the same training runtime as when num_val_windows=1.

Expected behavior I expect a test with refit_every_n_windows=None to take roughly the same time for the same dataset, no matter the size of num_val_windows.

To Reproduce

import pandas as pd
import numpy as np
from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

df = pd.DataFrame(
    {
        "timestamp": pd.date_range("2023-01-01", "2024-01-01", freq="min", inclusive="left"),
        "target": np.sin(np.arange(365*24*60)),
        "item_id": ["item_one"]*365*24*60,
    }
)
train_data = TimeSeriesDataFrame.from_data_frame(
    df,
    id_column="item_id",
    timestamp_column="timestamp"
)

# this will train fast since num_val_windows=1
predictor1 = TimeSeriesPredictor(
    prediction_length=15,
    path="refit_every_n_windows_test",
    target="target",
    eval_metric="MAE",
)
predictor1.fit(
    train_data,
    presets="fast_training",
    num_val_windows=1,
)

# this is obviously much slower since num_val_windows=24*4
predictor2 = TimeSeriesPredictor(
    prediction_length=15,
    path="refit_every_n_windows_test",
    target="target",
    eval_metric="MAE",
)
predictor2.fit(
    train_data,
    presets="fast_training",
    num_val_windows=24*4,
)

# I'd expect this to be as fast as predictor1 (at least in training time), but it's more like predictor 2 (much slower)
predictor3 = TimeSeriesPredictor(
    prediction_length=15,
    path="refit_every_n_windows_test",
    target="target",
    eval_metric="MAE",
)
predictor3.fit(
    train_data,
    presets="fast_training",
    num_val_windows=24*4,
    refit_every_n_windows=None,
)

Screenshots / Logs

OUTPUT FOR predictor1:

Beginning AutoGluon training...
AutoGluon will save models to 'refit_every_n_windows_test'
=================== System Info ===================
AutoGluon Version:  1.1.1
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22631
CPU Count:          12
GPU Count:          0
Memory Avail:       1.08 GB / 15.69 GB (6.9%)
Disk Space Avail:   101.43 GB / 235.67 GB (43.0%)
===================================================
Setting presets to: fast_training

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MAE,
 'hyperparameters': 'very_light',
 'known_covariates_names': [],
 'num_val_windows': 1,
 'prediction_length': 15,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'target',
 'verbosity': 2}

Inferred time series frequency: 'min'
Provided train_data has 525600 rows, 1 time series. Median time series length is 525600 (min=525600, max=525600). 

Provided data contains following columns:
    target: 'target'

AutoGluon will gauge predictive performance using evaluation metric: 'MAE'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================

Starting training. Start time is 2024-09-06 11:39:32
Models that will be trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta']
Training timeseries model Naive. 
    -1.0222       = Validation score (-MAE)
    0.36    s     = Training runtime
    0.09    s     = Validation (prediction) runtime
Training timeseries model SeasonalNaive. 
    -0.7292       = Validation score (-MAE)
    0.34    s     = Training runtime
    0.09    s     = Validation (prediction) runtime
Training timeseries model RecursiveTabular. 
    -0.0006       = Validation score (-MAE)
    36.15   s     = Training runtime
    2.41    s     = Validation (prediction) runtime
Training timeseries model DirectTabular. 
    -0.0008       = Validation score (-MAE)
    28.96   s     = Training runtime
    0.53    s     = Validation (prediction) runtime
Training timeseries model ETS. 
    -0.6342       = Validation score (-MAE)
    0.31    s     = Training runtime
    0.11    s     = Validation (prediction) runtime
Training timeseries model Theta. 
    -0.9614       = Validation score (-MAE)
    0.31    s     = Training runtime
    0.20    s     = Validation (prediction) runtime
Fitting simple weighted ensemble.
    Ensemble weights: {'DirectTabular': 0.62, 'RecursiveTabular': 0.38}
    -0.0005       = Validation score (-MAE)
    0.39    s     = Training runtime
    2.94    s     = Validation (prediction) runtime
Training complete. Models trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta', 'WeightedEnsemble']
Total runtime: 70.57 s
Best model: WeightedEnsemble
Best model score: -0.0005

OUTPUT FOR predictor2 (I never managed to finish it, seemed to get stuck on RecursiveTabular):

Beginning AutoGluon training...
AutoGluon will save models to 'refit_every_n_windows_test'
=================== System Info ===================
AutoGluon Version:  1.1.1
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22631
CPU Count:          12
GPU Count:          0
Memory Avail:       2.95 GB / 15.69 GB (18.8%)
Disk Space Avail:   100.57 GB / 235.67 GB (42.7%)
===================================================
Setting presets to: fast_training

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MAE,
 'hyperparameters': 'very_light',
 'known_covariates_names': [],
 'num_val_windows': 96,
 'prediction_length': 15,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_every_n_windows': 1,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'target',
 'verbosity': 2}

Inferred time series frequency: 'min'
Provided train_data has 525600 rows, 1 time series. Median time series length is 525600 (min=525600, max=525600). 

Provided data contains following columns:
    target: 'target'

AutoGluon will gauge predictive performance using evaluation metric: 'MAE'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================

Starting training. Start time is 2024-09-06 14:35:41
Models that will be trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta']
Training timeseries model Naive. 
    -0.8330       = Validation score (-MAE)
    52.87   s     = Training runtime
    0.08    s     = Validation (prediction) runtime
Training timeseries model SeasonalNaive. 
    -0.6931       = Validation score (-MAE)
    38.15   s     = Training runtime
    0.08    s     = Validation (prediction) runtime
Training timeseries model RecursiveTabular. 

OUTPUT FOR predictor3:

Beginning AutoGluon training...
AutoGluon will save models to 'refit_every_n_windows_test'
=================== System Info ===================
AutoGluon Version:  1.1.1
Python Version:     3.10.13
Operating System:   Windows
Platform Machine:   AMD64
Platform Version:   10.0.22631
CPU Count:          12
GPU Count:          0
Memory Avail:       2.34 GB / 15.69 GB (14.9%)
Disk Space Avail:   100.90 GB / 235.67 GB (42.8%)
===================================================
Setting presets to: fast_training

Fitting with arguments:
{'enable_ensemble': True,
 'eval_metric': MAE,
 'hyperparameters': 'very_light',
 'known_covariates_names': [],
 'num_val_windows': 96,
 'prediction_length': 15,
 'quantile_levels': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
 'random_seed': 123,
 'refit_full': False,
 'skip_model_selection': False,
 'target': 'target',
 'verbosity': 2}

Inferred time series frequency: 'min'
Provided train_data has 525600 rows, 1 time series. Median time series length is 525600 (min=525600, max=525600). 

Provided data contains following columns:
    target: 'target'

AutoGluon will gauge predictive performance using evaluation metric: 'MAE'
    This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
===================================================

Starting training. Start time is 2024-09-06 12:14:39
Models that will be trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta']
Training timeseries model Naive. 
    -0.8330       = Validation score (-MAE)
    38.34   s     = Training runtime
    0.19    s     = Validation (prediction) runtime
Training timeseries model SeasonalNaive. 
    -0.6931       = Validation score (-MAE)
    32.74   s     = Training runtime
    0.10    s     = Validation (prediction) runtime
Training timeseries model RecursiveTabular. 
    -0.0003       = Validation score (-MAE)
    292.69  s     = Training runtime
    2.82    s     = Validation (prediction) runtime
Training timeseries model DirectTabular. 
    -0.0007       = Validation score (-MAE)
    75.64   s     = Training runtime
    0.43    s     = Validation (prediction) runtime
Training timeseries model ETS. 
    -0.6369       = Validation score (-MAE)
    139.11  s     = Training runtime
    0.12    s     = Validation (prediction) runtime
Training timeseries model Theta. 
    -0.8383       = Validation score (-MAE)
    87.63   s     = Training runtime
    0.21    s     = Validation (prediction) runtime
Fitting simple weighted ensemble.
    Ensemble weights: {'DirectTabular': 0.08, 'RecursiveTabular': 0.92}
    -0.0003       = Validation score (-MAE)
    47.86   s     = Training runtime
    3.25    s     = Validation (prediction) runtime
Training complete. Models trained: ['Naive', 'SeasonalNaive', 'RecursiveTabular', 'DirectTabular', 'ETS', 'Theta', 'WeightedEnsemble']
Total runtime: 738.78 s
Best model: WeightedEnsemble
Best model score: -0.0003

Installed Versions


INSTALLED VERSIONS
------------------
date                   : 2024-09-06
time                   : 15:32:03.974121
python                 : 3.10.13.final.0
OS                     : Windows
OS-release             : 10
Version                : 10.0.22631
machine                : AMD64
processor              : Intel64 Family 6 Model 186 Stepping 3, GenuineIntel
num_cores              : 12
cpu_ram_mb             : 16068.703125
cuda version           : None
num_gpus               : 0
gpu_ram_mb             : []
avail_disk_size_mb     : None

accelerate             : 0.21.0
autogluon              : 1.1.1
autogluon.common       : 1.1.1
autogluon.core         : 1.1.1
autogluon.features     : 1.1.1
autogluon.multimodal   : 1.1.1
autogluon.tabular      : 1.1.1
autogluon.timeseries   : 1.1.1
boto3                  : 1.34.27
catboost               : 1.2.5
defusedxml             : 0.7.1
evaluate               : 0.4.2
fastai                 : 2.7.17
gluonts                : 0.15.1
hyperopt               : 0.2.7
imodels                : None
jinja2                 : 3.1.3
joblib                 : 1.3.2
jsonschema             : 4.21.1
lightgbm               : 4.3.0
lightning              : 2.2.1
matplotlib             : 3.8.2
mlforecast             : 0.10.0
networkx               : 3.2.1
nlpaug                 : 1.1.11
nltk                   : 3.9.1
nptyping               : 2.4.1
numpy                  : 1.26.3
nvidia-ml-py3          : 7.352.0
omegaconf              : 2.2.3
onnxruntime-gpu        : None
openmim                : 0.3.9
optimum                : 1.18.1
optimum-intel          : None
orjson                 : 3.10.7
pandas                 : 2.2.0
pdf2image              : 1.17.0
Pillow                 : 10.2.0
psutil                 : 5.9.8
pytesseract            : 0.3.10
pytorch-lightning      : 2.2.1
pytorch-metric-learning: 2.3.0
ray                    : 2.10.0
requests               : 2.32.3
scikit-image           : 0.20.0
scikit-learn           : 1.4.0
scikit-learn-intelex   : None
scipy                  : 1.12.0
seqeval                : 1.2.2
setuptools             : 68.2.2
skl2onnx               : None
statsforecast          : 1.4.0
tabpfn                 : None
tensorboard            : 2.15.1
text-unidecode         : 1.3
timm                   : 0.9.16
torch                  : 2.3.1
torchmetrics           : 1.2.1
torchvision            : 0.18.1
tqdm                   : 4.66.5
transformers           : 4.39.3
utilsforecast          : 0.0.10
vowpalwabbit           : None
xgboost                : 2.0.3

autogluon / autogluon

[BUG] refit_every_n_windows doesn't reduce training time for long time series #4457