DeepAR sequence padding with zeros leads to `nan` loss in Gamma and Beta distributions.

AaronSpieler commented 4 years ago

Description

When training a DeepAREstimator some of the input is randomly set to 0. This is especially apparent for distributions where 0 is not in the support of the distribution, like for the Beta and Gamma distributions whose support is (0,1), and (0, inf) respectively.

Even though my training and test data is in the interval [0.4, 0.6], the log probability gets calculated on data like this (printed for the student-t Distribution):

[[0.         0.         0.         ... 0.4895292  0.4855527  0.4815994 ]
 [0.5996926  0.5999265  0.59999985 ... 0.41260046 0.41461775 0.41677222]
 [0.5999126  0.59966487 0.5992571  ... 0.41906032 0.42147845 0.4240227 ]
 ...
 [0.         0.         0.         ... 0.52530396 0.5214071  0.5174759 ]
 [0.5995836  0.5991383  0.5985338  ... 0.42206398 0.42463723 0.4273315 ]
 [0.5694312  0.5664917  0.5634455  ... 0.48193955 0.4858951  0.48987332]]

As we can see most data-points can be found in the generated sinusoidal (if one checks), except for the 0.s. For distributions like Gamma or Beta this leads to numeric issues when calculating the log-prob, HOWEVER, for other models this could potentially significantly impact performance.

To Reproduce

One can see this by setting hybridize=False in the Trainer and then printing x with print(x.asnumpy()) in the log_prob function of the corresponding probability function.

I tested this, and the error occurs at least the Student-t, Gamma and Beta Distributions.

First creating a dataset without any values even close to 0.

freq="1H"
pred_length=48

x = (np.sin(np.linspace(0, 20, 500))/10)+0.5

train_ds = ListDataset(
    [
        {'target': x, 'start': 0}
    ],
    freq=freq
)

test_ds = ListDataset(
    [
        {'target': x[:-pred_length], 'start': 0}
    ],
    freq=freq
)

then we try to train our DeepAREstimator:

estimator = DeepAREstimator(
    prediction_length=pred_length,
    freq=freq,
    distr_output=GammaOutput(),
    trainer=Trainer(learning_rate=0.001, hybridize=False)
    ,scaling=False
)

estimator_name = type(estimator).__name__

print(f"evaluating {estimator_name} on Custom")

agg_metrics, item_metrics = backtest_metrics(
    train_dataset=train_ds,
    test_dataset=test_ds,
    forecaster=estimator
)

Now we modify the log_prob() function to print the matrices:

    def log_prob(self, x: Tensor) -> Tensor:
        # some code ...
        print(x.asnumpy())
        # some code ...
        return ll

Error Message

GluonTSDataError                          Traceback (most recent call last)
<ipython-input-21-2d726dd30ff6> in <module>
     11     train_dataset=train_ds,
     12     test_dataset=test_ds,
---> 13     forecaster=estimator
     14 )

~/WorkingDirectory/gluon-ts/src/gluonts/evaluation/backtest.py in backtest_metrics(train_dataset, test_dataset, forecaster, evaluator, num_samples, logging_file, use_symbol_block_predictor)
    174         serialize_message(logger, estimator_key, forecaster)
    175         assert train_dataset is not None
--> 176         predictor = forecaster.train(train_dataset)
    177 
    178         if isinstance(forecaster, GluonEstimator) and isinstance(

~/WorkingDirectory/gluon-ts/src/gluonts/model/estimator.py in train(self, training_data, validation_data)
    221         self, training_data: Dataset, validation_data: Optional[Dataset] = None
    222     ) -> Predictor:
--> 223         return self.train_model(training_data, validation_data).predictor

~/WorkingDirectory/gluon-ts/src/gluonts/model/estimator.py in train_model(self, training_data, validation_data)
    206             input_names=get_hybrid_forward_input_names(trained_net),
    207             train_iter=training_data_loader,
--> 208             validation_iter=validation_data_loader,
    209         )
    210 

~/WorkingDirectory/gluon-ts/src/gluonts/trainer/_base.py in __call__(self, net, input_names, train_iter, validation_iter)
    295                     )
    296 
--> 297                     epoch_loss = loop(epoch_no, train_iter)
    298                     if is_validation_available:
    299                         epoch_loss = loop(

~/WorkingDirectory/gluon-ts/src/gluonts/trainer/_base.py in loop(epoch_no, batch_iter, is_training)
    274 
    275                     # check and log epoch loss
--> 276                     check_loss_finite(loss_value(epoch_loss))
    277                     logging.info(
    278                         "Epoch[%d] Evaluation metric '%s'=%f",

~/WorkingDirectory/gluon-ts/src/gluonts/trainer/_base.py in check_loss_finite(val)
     48     if not np.isfinite(val):
     49         raise GluonTSDataError(
---> 50             "Encountered invalid loss value! Try reducing the learning rate "
     51             "or try a different likelihood."
     52         )

GluonTSDataError: Encountered invalid loss value! Try reducing the learning rate or try a different likelihood.

Environment

Operating system: MacOS
Python version: 3.6
GluonTS version: 4.2

jgasthaus commented 4 years ago

The zeros you are seeing in the target and the error you are getting might be unrelated.

By default, DeepAR also samples training instances from the beginning of the sequence, where the initial part of the sampled window can start before observations are available. This is controlled by the pick_incomplete parameter of the InstanceSplitter, which is true by default:

https://github.com/awslabs/gluon-ts/blob/02131317b676467993df9f8dbea79215eb3eb314/src/gluonts/transform/split.py#L106-L111

In this case, the initial portion (of the target and all time-varying features) is padded with zeros, which seems to be what you are observing here -- this is the intended behavior. However, while the loss function is evaluated on these padded zero values, they are removed from the final loss calculation:

https://github.com/awslabs/gluon-ts/blob/02131317b676467993df9f8dbea79215eb3eb314/src/gluonts/model/deepar/_network.py#L411-L423

The observed_values feature is also zero-padded by the InstanceSplitter, marking the padded target values as unobserved.

You could try modifying DeepAR to use pick_incomplete=False to see if this fixes the problem (in which case there is a bug in the masking code somewhere).

AaronSpieler commented 4 years ago

Setting pick_inclomplete=False fixes the problem. So as we can see from how the weighted_loss is computed, the calculated loss is weighted with 0 for not observed values. If the loss is nan, however, (because of log(0) for example), then 0*nan=nan and the weighted loss becomes nan.

We should fix this, by setting the value of the padding dependent on the distribution: e.g. padding with 0.5 for Beta and Gamma instead of with 0.

lostella commented 4 years ago

It’s maybe better to try improving the masking logic first: one way could be to use where instead of multiplication, see https://mxnet.apache.org/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.where

AaronSpieler commented 4 years ago

It’s maybe better to try improving the masking logic first: one way could be to use where instead of multiplication, see https://mxnet.apache.org/api/python/docs/api/ndarray/ndarray.html#mxnet.ndarray.where

Yeah, I absolutely agree, I just thought of that too omw home.

AaronSpieler commented 4 years ago

This is fixed by https://github.com/awslabs/gluon-ts/pull/534.

StatMixedML commented 4 years ago

@AaronSpieler I was just going through this issue since I am faced with the following problem: I have a lot of time series with different lengths in the train set, so I was wondering if I need to pad them with 0 for them to be of the same length. Is that something that I need to do manually using FieldName.IS_PAD and/or FieldName.OBSERVED_VALUES or is that being taken care of automatically? Also, I don`t know how to use pick_incomplete.

Would greatly appreciate your help.

lostella commented 4 years ago

@StatMixedML no need to pad your time series: if pick_incomplete = True, the model will be trained by occasionally sampling training windows that partially fall outside (to the left) of your time series, and the initial missing data will be automatically padded; if pick_incomplete = False, then only training windows with genuine data will be sampled to form training batches, and no padding happens.

Hope that clarifies!

StatMixedML commented 4 years ago

@lostella Thanks for clarifying. I am still not sure I entirely understand it. Assume that the prediction length = 18 months. However, most of the time series in the training set have length around 6-8 months, some longer, but all are irregularly spaced. For the DeepAR model I set prediction_length = context_length = 24. I padded the time series that have less than 24 months of observations to meet the prediction_length and context_length. All of them also have different starting times.

So you`d suggest NOT TO PAD the training data and to set context_length = min(train_length)? Or how would you tackle the problem of short and irregularly spaced time series? Sorry for the confusion. Also, how can I set pick_incomplete in the code?

Many thanks!

fernandocamargoai commented 4 years ago

@AaronSpieler, your pull request was merged, but not released yet, right? I'm having possibly the same problem with DeepAR and Negative binomial distribution.

AaronSpieler commented 4 years ago

@AaronSpieler, your pull request was merged, but not released yet, right? I'm having possibly the same problem with DeepAR and Negative binomial distribution.

That's correct. It will be in the v0.5 release.

AaronSpieler commented 4 years ago

@lostella Thanks for clarifying. I am still not sure I entirely understand it. Assume that the prediction length = 18 months. However, most of the time series in the training set have length around 6-8 months, some longer, but all are irregularly spaced. For the DeepAR model I set _prediction_length = contextlength = 24. I padded the time series that have less than 24 months of observations to meet the _predictionlength and _contextlength. All of them also have different starting times.

So you`d suggest NOT TO PAD the training data and to set _context_length = min(trainlength)? Or how would you tackle the problem of short and irregularly spaced time series? Sorry for the confusion. Also, how can I set _pickincomplete in the code?

Many thanks!

The fact of irregularly spaced time-series could be a problem. You could try so pre-processing steps like filling the values with the average of the neighboring, or other values that make sense, to have regularly spaced time series in the end. Aggregation is another alternative.

Even bigger problem is if you want to have prediction_length=context_length=24 and your time series is not even 24 long, because you don't even have a proper target at that point. Or do you mean after removing the target 24 you don't have 24 anymore, because in that case you don't need to pad manually, just set pick_incomplete = True as suggested by @lostella .

I checked in case of DeepAR, and it has pick_incomplete = True so you don't need to do anything in that regard.

awslabs / gluonts