awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.55k stars 747 forks source link

Why DeepAR cannot next value of a random walk given exact knowledge? #488

Closed debackerl closed 4 years ago

debackerl commented 4 years ago

Hello,

To practice with GluonTS, I've built a synthetic dataset to train a simple DeepAR model. Basically, I generate a random walk where each step is drawn from a uniform between -1 and +1.

I make it 10 000 steps long, train a DeepAR with a single layer and 5 cells. Each step is provided to the model as a dynamic feature, so that there is no uncertainty present. Yet, the model is having a bad performance.

import pandas as pd
import numpy as np
import mxnet as mx
from gluonts.dataset.common import ListDataset
from gluonts.dataset.field_names import FieldName
from gluonts.evaluation import Evaluator
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.model.deepar import DeepAREstimator
from gluonts.trainer import Trainer

t0 = pd.Timestamp(year=2000, month=1, day=1, freq='B')
terms = np.random.rand(10000) * 2.0 - 1.0
walk = np.cumsum(terms)
ctx = mx.gpu()

context_length, prediction_length = 1, 1

# At time t, model knows previous value at t-1, and new term/increment at time t, giving full information
train_ds = ListDataset([{FieldName.START: t0, FieldName.TARGET: walk, FieldName.FEAT_DYNAMIC_REAL: [terms]}], freq=t0.freq)
trainer = Trainer(ctx=ctx, epochs=100, batch_size=128, num_batches_per_epoch=50)
estimator = DeepAREstimator(freq='B', num_layers=1, num_cells=5, trainer=trainer, context_length=context_length, prediction_length=prediction_length, use_feat_dynamic_real=True)
predictor = estimator.train(training_data=train_ds)

# test is a subset of train set, so I don't even test generalization, I simply test learning
test_ds = ListDataset([{FieldName.START: t0 + t*t0.freq, FieldName.TARGET: walk[t-estimator.history_length:t+prediction_length], FieldName.FEAT_DYNAMIC_REAL: [terms[t-estimator.history_length:t+prediction_length]]} for t in range(1000, 2000)], freq=t0.freq)
forecast_it, ts_it = make_evaluation_predictions(dataset=test_ds, predictor=predictor, num_samples=100)
evaluator = Evaluator(quantiles=[0.1, 0.5, 0.9])
agg_metrics, series_metrics = evaluator(ts_it, forecast_it, num_series=len(test_ds))
print(agg_metrics)

This gives me

{'MSE': 0.2924841361204017, 'abs_error': 433.68280267715454, 'abs_target_sum': 23014.57519197464, 'abs_target_mean': 23.01457519197464, 'seasonal_error': 1.0600714333852133, 'MASE': 0.631015149464367, 'sMAPE': 0.02015682482771147, 'MSIS': 21.265574864578436, 'QuantileLoss[0.1]': 493.3332794189453, 'Coverage[0.1]': 0.653, 'QuantileLoss[0.5]': 433.68280267715454, 'Coverage[0.5]': 0.693, 'QuantileLoss[0.9]': 260.0980569839478, 'Coverage[0.9]': 0.731, 'RMSE': 0.5408180249588596, 'NRMSE': 0.023498935802536428, 'ND': 0.01884383261735733, 'wQuantileLoss[0.1]': 0.021435689136290226, 'wQuantileLoss[0.5]': 0.01884383261735733, 'wQuantileLoss[0.9]': 0.011301449399537299, 'mean_wQuantileLoss': 0.017193657051061618, 'MAE_Coverage': 0.305}

The MASE and RMSE seem pretty high to me. They should be as close to zero as possible given that the model has full knowledge.

Did I forget anything? I doubled the number of layers, but while the final loss after training was smaller, the MASE was 3.17 and RMSE was 2.50.

Thank you! :-) Laurent

davidlkl commented 4 years ago

Try setting lags_seq = [1] and test again?

By default it will be using a series of lagging target values, which is unnecessary for your dataset.

mbohlkeschneider commented 4 years ago

Hi @debackerl ,

thank you for the interesting experiment. @davidlkl is correct - the lagged target value will probably hurt in this case.

In fact, I think there are a couple of model choices in DeepAR that do not go well with this kind of problem.

  1. The lags are chosen to model seasonality. The random walk will not have a predictable seasonality pattern to learn.
  2. We also use additional features (like day of the week) that are unnecessary in this setup and might hinder learning.
  3. The probabilistic output might also not go well with such a deterministic learning problem.

I did the following:

I removed the default time features in the DeepAR estimator:

                #AddTimeFeatures(
                #    start_field=FieldName.START,
                #    target_field=FieldName.TARGET,
                #    output_field=FieldName.FEAT_TIME,
                #    time_features=self.time_features,
                #    pred_length=self.prediction_length,
                #),
                #AddAgeFeature(
                #    target_field=FieldName.TARGET,
                #    output_field=FieldName.FEAT_AGE,
                #    pred_length=self.prediction_length,
                #    log_scale=True,
                #    dtype=self.dtype,
                #),
                #VstackFeatures(
                #    output_field=FieldName.FEAT_TIME,
                #    input_fields=[FieldName.FEAT_TIME, FieldName.FEAT_AGE]
                #    + (
                #        [FieldName.FEAT_DYNAMIC_REAL]
                #        if self.use_feat_dynamic_real
                #        else []
                #    ),
                #),
                VstackFeatures(
                   output_field=FieldName.FEAT_TIME,
                   input_fields=[FieldName.FEAT_DYNAMIC_REAL]),

And I used this snipplet:

import pandas as pd
import numpy as np
import mxnet as mx
from gluonts.dataset.common import ListDataset
from gluonts.dataset.field_names import FieldName
from gluonts.evaluation import Evaluator
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.model.deepar import DeepAREstimator
from gluonts.trainer import Trainer
from gluonts.distribution.laplace import LaplaceOutput

t0 = pd.Timestamp(year=2000, month=1, day=1, freq='B')
terms = np.random.rand(10000) * 2.0 - 1.0
walk = np.cumsum(terms)
ctx = mx.cpu()
#terms = np.roll(terms, 1)

context_length, prediction_length = 1, 1

# At time t, model knows previous value at t-1, and new term/increment at time t, giving full information
train_ds = ListDataset([{FieldName.START: t0, FieldName.TARGET: walk, FieldName.FEAT_DYNAMIC_REAL: [terms]}], freq=t0.freq)
trainer = Trainer(ctx=ctx, epochs=200, batch_size=16, num_batches_per_epoch=50)
estimator = DeepAREstimator(freq='B', num_layers=1, num_cells=5, trainer=trainer, context_length=context_length, prediction_length=prediction_length, use_feat_dynamic_real=True, lags_seq=[1])
predictor = estimator.train(training_data=train_ds)

# test is a subset of train set, so I don't even test generalization, I simply test learning
test_ds = ListDataset([{FieldName.START: t0 + t*t0.freq, FieldName.TARGET: walk[t-estimator.history_length:t+prediction_length], FieldName.FEAT_DYNAMIC_REAL: [terms[t-estimator.history_length:t+prediction_length]]} for t in range(1000, 2000)], freq=t0.freq)
forecast_it, ts_it = make_evaluation_predictions(dataset=test_ds, predictor=predictor, num_samples=1000)
evaluator = Evaluator(quantiles=[0.1, 0.5, 0.9])
agg_metrics, series_metrics = evaluator(ts_it, forecast_it, num_series=len(test_ds))
print(agg_metrics)

which gives me this result: {'MSE': 0.0026946234173055926, 'abs_error': 41.87844657897949, 'abs_target_sum': 22620.477743148804, 'abs_target_mean': 22.620477743148804, 'seasonal_error': 0.48385462474823, 'MASE': 0.24065510051029348, 'sMAPE': 0.0019129342584074318, 'MSIS': 1.343544394708052, 'QuantileLoss[0.1]': 16.50373649597168, 'Coverage[0.1]': 0.016, 'QuantileLoss[0.5]': 41.87844657897949, 'Coverage[0.5]': 0.596, 'QuantileLoss[0.9]': 19.33512268066406, 'Coverage[0.9]': 0.956, 'RMSE': 0.051909762254373625, 'NRMSE': 0.0022948128171207983, 'ND': 0.0018513511100208067, 'wQuantileLoss[0.1]': 0.0007295927470395829, 'wQuantileLoss[0.5]': 0.0018513511100208067, 'wQuantileLoss[0.9]': 0.0008547619064553224, 'mean_wQuantileLoss': 0.0011452352545052375, 'MAE_Coverage': 0.07866666666666665}

Note that the MASE in our implementation is the seasonal MASE and therefore not very meaningful for this experiment. MSE and RMSE look low enough to me.

I think this is a classic No Free Lunch example: The model choices in DeepAR are not great for this synthetic data and the model that would work for this synthetic data likely does not do well with real data :-).

Hope that helps.

debackerl commented 4 years ago

Awesome, thank you both! First working with a synthetic dataset allows me to see if I understand the API and the model.

I believed that DeepAR might also be a strong model for processes which are highly correlated to many dynamic features, even if there is weak seasonability. I will continue my experiments :-)