Closed borchero closed 3 years ago
Also see #1512
Is the snippet missing batch_size=50
? (see description above)
Ah sorry, I meant the default, which (I just checked) is 32.
A faster way to reproduce this (not sure how many batches one has to wait, however):
from itertools import islice
import numpy as np
from tqdm import tqdm
from gluonts.dataset.repository.datasets import get_dataset
from gluonts.itertools import Cached
from gluonts.model.deepar import DeepAREstimator
np.random.seed(0)
dataset = get_dataset("m4_daily")
estimator = DeepAREstimator(
freq=dataset.metadata.freq,
prediction_length=dataset.metadata.prediction_length,
context_length=4 * dataset.metadata.prediction_length,
)
transformed_dataset = Cached(estimator.create_transformation().apply(dataset.train))
num_batches = 300_000
for batch in tqdm(islice(estimator.create_training_data_loader(transformed_dataset), num_batches), total=num_batches):
pass
I got the error after ~15k batches:
5%|███▊ | 15055/300000 [01:16<24:09, 196.52it/s]
Traceback (most recent call last):
File "issues/run_1513.py", line 24, in <module>
for batch in tqdm(islice(estimator.create_training_data_loader(transformed_dataset), num_batches), total=num_batches):
File "/Users/stellalo/.virtualenvs/gluonts/lib/python3.7/site-packages/tqdm/std.py", line 1166, in __iter__
for obj in iterable:
File "/Users/stellalo/gluon-ts/src/gluonts/itertools.py", line 51, in get_batch
return list(itertools.islice(it, batch_size))
File "/Users/stellalo/gluon-ts/src/gluonts/transform/_base.py", line 104, in __iter__
self.base_dataset, is_train=self.is_train
File "/Users/stellalo/gluon-ts/src/gluonts/transform/_base.py", line 123, in __call__
for data_entry in data_it:
File "/Users/stellalo/gluon-ts/src/gluonts/transform/_base.py", line 184, in __call__
f"Reached maximum number of idle transformation calls.\n"
Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation:
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=14, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=0, min_future=14, num_instances=1.0, total_length=1129623519, n=481630), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=1149, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])
The problem is that some time series in the dataset are extremely short compared to the average length. This results in the instance splitter sampling no training instances for them, which breaks the logic at some point (after a certain number of idle transformations, an exception is raised). This is due to the fact that the sampling strategy tries to sample all single time points across the dataset uniformly, so observations from shorter time series need to be sampled with lower probability.
I believe that the exception mechanism was put there to prevent 100% idle iteration due to data being shorter than required by the method. But maybe this can be done in a different way, and avoid raising exceptions when only some series are short.
Fixed by #1546
Description
Training on the M4 Daily fails on multiple models provided by GluonTS, namely:
Funnily, training always fails after 70 epochs when using a batch size of 32 and 2472 batches per epoch. The issue seems to be related to data transformations.
Note that training works using MQRNN/MQCNN.
To Reproduce
And wait until epoch 70...
Error message or code output
Stacktrace for training using any of the models above:
Environment
3.8.9
1.8.0.post0
Full list of dependencies: