Training on M4 Daily fails for multiple models

borchero commented 3 years ago

Description

Training on the M4 Daily fails on multiple models provided by GluonTS, namely:

DeepAR
NBEATS
Simple Feedforward
Temporal Fusion Transformer

Funnily, training always fails after 70 epochs when using a batch size of 32 and 2472 batches per epoch. The issue seems to be related to data transformations.

Note that training works using MQRNN/MQCNN.

To Reproduce

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.model.deepar import DeepAREstimator
from gluonts.mx.trainer import Trainer
from gluonts.mx.trainer.learning_rate_scheduler import LearningRateReduction

dataset = get_dataset("m4_daily")
callbacks = [
    LearningRateReduction(
        objective="min", patience=9, base_lr=1e-3, decay_factor=0.5
    )
]
estimator = DeepAREstimator(
    freq=dataset.metadata.freq,
    prediction_length=dataset.metadata.prediction_length,
    trainer=Trainer(epochs=81, num_batches_per_epoch=2472, callbacks=callbacks),
    context_length=4 * dataset.metadata.prediction_length,
)
estimator.train(dataset.train)

And wait until epoch 70...

Error message or code output

Stacktrace for training using any of the models above:

File "/usr/local/lib/python3.8/site-packages/gluonts/mx/model/estimator.py", line 201, in train
    return self.train_model(
  File "/usr/local/lib/python3.8/site-packages/gluonts/mx/model/estimator.py", line 176, in train_model
    self.trainer(
  File "/usr/local/lib/python3.8/site-packages/gluonts/mx/trainer/_base.py", line 436, in __call__
    epoch_loss = loop(
  File "/usr/local/lib/python3.8/site-packages/gluonts/mx/trainer/_base.py", line 325, in loop
    for batch_no, batch in enumerate(it, start=1):
  File "/usr/local/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.8/site-packages/gluonts/itertools.py", line 51, in get_batch
    return list(itertools.islice(it, batch_size))
  File "/usr/local/lib/python3.8/site-packages/gluonts/transform/dataset.py", line 53, in __iter__
    yield from self.transformation(
  File "/usr/local/lib/python3.8/site-packages/gluonts/transform/_base.py", line 81, in __call__
    for data_entry in data_it:
  File "/usr/local/lib/python3.8/site-packages/gluonts/transform/_base.py", line 141, in __call__
    raise Exception(
Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation:
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=14, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=0, min_future=14, num_instances=1.0, total_length=12991544461, n=5541354), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=1121, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])

Environment

Operating system: Debian Buster
Python version: 3.8.9
GluonTS version: Master, Post 0.7.0 (commit f6948bacb7a038df3374e768ad4939455c74b49d)
MXNet version: 1.8.0.post0

Full list of dependencies:

PyYAML = "^5.4.1"
click = "^7.1.2"
fastparquet = "^0.6.1"
fbprophet = "^0.7.1"
gluonts = {git = "https://github.com/awslabs/gluon-ts.git", rev = "f6948bacb7a038df3374e768ad4939455c74b49d"}
holidays = "^0.11.1"
mxnet = "^1.8.0"
numpy = "^1.20.3"
pandas = "^1.2.4"
pyarrow = "^4.0.0"
pydantic = "^1.8.2"
pystan = "^2.0.0"
python = ">=3.8,<3.10"
sagemaker = "^2.40.0"
sagemaker-training = "^3.9.2"
scikit-learn = "^0.24.2"
scipy = "^1.6.3"
toolz = "^0.11.1"
tqdm = "^4.60.0"
ujson = "^4.0.2"
xgboost = "^1.4.1"

borchero commented 3 years ago

Also see #1512

lostella commented 3 years ago

Is the snippet missing batch_size=50? (see description above)

borchero commented 3 years ago

Ah sorry, I meant the default, which (I just checked) is 32.

lostella commented 3 years ago

A faster way to reproduce this (not sure how many batches one has to wait, however):

from itertools import islice

import numpy as np
from tqdm import tqdm

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.itertools import Cached
from gluonts.model.deepar import DeepAREstimator

np.random.seed(0)

dataset = get_dataset("m4_daily")

estimator = DeepAREstimator(
    freq=dataset.metadata.freq,
    prediction_length=dataset.metadata.prediction_length,
    context_length=4 * dataset.metadata.prediction_length,
)

transformed_dataset = Cached(estimator.create_transformation().apply(dataset.train))

num_batches = 300_000

for batch in tqdm(islice(estimator.create_training_data_loader(transformed_dataset), num_batches), total=num_batches):
    pass

I got the error after ~15k batches:

  5%|███▊                                                                       | 15055/300000 [01:16<24:09, 196.52it/s]
Traceback (most recent call last):
  File "issues/run_1513.py", line 24, in <module>
    for batch in tqdm(islice(estimator.create_training_data_loader(transformed_dataset), num_batches), total=num_batches):
  File "/Users/stellalo/.virtualenvs/gluonts/lib/python3.7/site-packages/tqdm/std.py", line 1166, in __iter__
    for obj in iterable:
  File "/Users/stellalo/gluon-ts/src/gluonts/itertools.py", line 51, in get_batch
    return list(itertools.islice(it, batch_size))
  File "/Users/stellalo/gluon-ts/src/gluonts/transform/_base.py", line 104, in __iter__
    self.base_dataset, is_train=self.is_train
  File "/Users/stellalo/gluon-ts/src/gluonts/transform/_base.py", line 123, in __call__
    for data_entry in data_it:
  File "/Users/stellalo/gluon-ts/src/gluonts/transform/_base.py", line 184, in __call__
    f"Reached maximum number of idle transformation calls.\n"
Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation:
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=14, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=0, min_future=14, num_instances=1.0, total_length=1129623519, n=481630), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=1149, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])

lostella commented 3 years ago

The problem is that some time series in the dataset are extremely short compared to the average length. This results in the instance splitter sampling no training instances for them, which breaks the logic at some point (after a certain number of idle transformations, an exception is raised). This is due to the fact that the sampling strategy tries to sample all single time points across the dataset uniformly, so observations from shorter time series need to be sampled with lower probability.

I believe that the exception mechanism was put there to prevent 100% idle iteration due to data being shorter than required by the method. But maybe this can be done in a different way, and avoid raising exceptions when only some series are short.

borchero commented 3 years ago

Fixed by #1546

awslabs / gluonts