NaN losses leads to overfitting

sevstafiev commented 2 years ago

Description

I am using a retail store dataset with ~ 3000 time series. Their peculiarity is that there are "special days" when the number of sales grows sharply (on average, 1-2 sales per day, on the day of sale ~ 80). I ran into a problem that with an increase in the number of epochs, the model starts to produce NaN loss over time, overtraining and too often to produce a forecast as for a "special day". Having studied the issues, I decided that adding a validation dataset would help with retraining, since, as I understood, along with the validation, a stop mechanism was also built in. As a validation, I began to take the last 60 days of the train (based on the logic of the context_length = 60 parameter) and 28 days after the train (based on the logic of the prediction_length = 28 parameter) for each time series. I use 28 days after validation as a test. This really helped to track the improvement and deterioration of the quality of the model, however, over time, the model starts to produce NaN loss and the quality on validation drops significantly, that is, the problem has not been solved. Nevertheless, the model does not stop training, but goes through all the remaining batches and epochs, giving NaN loss. As a result, after learning, it begins to give excessively high predictions.

In this issue https://github.com/awslabs/gluon-ts/issues/833 the author was able to overcome the problem with putting a conditional breakpoint in the log_prob() to stop whenever a NaN value is generated. But I did not understand how this can be done and how to get to log_prob() at all. If you can tell me how to do this, then this will also be a good solution to the problem.

Since the model refits each time when a new piece is added to the data, it is impossible to guess the optimal number of epochs. If I put a little (1-3), then the model is underfitting and the quality will be poor, however, the fewer epochs, the less chance that NaN loss will occur. If I put a large number of epochs (20), then NaN loss is guaranteed to occur and the predictions will be bad.

To Reproduce

trainer = Trainer(
      ctx=device,
      epochs=20, 
      learning_rate_decay_factor=0.5,
      patience=3,
      minimum_learning_rate=0.001,
      clip_gradient=1.0,
      weight_decay=1e-08,
      learning_rate=0.01,
      hybridize = False, #True changed nothing but training speed
      batch_size = 32,
  )

deepar_estimator = DeepAREstimator(
    freq="D", 
    prediction_length=h,
    trainer=trainer,
    context_length=60, 
    num_layers=2,
    num_cells=100,
    cell_type="lstm",
    dropout_rate=0.1,
    use_feat_dynamic_real=True,
    use_feat_static_cat=True,
    cardinality=cardinality,
    distr_output=NegativeBinomialOutput(),
)

Error message or code output

(validation_avg_epoch_loss=0.363)

 0%|          | 0/50 [00:00<?, ?it/s]
 40%|████      | 20/50 [00:10<00:15,  1.99it/s, epoch=3/20, avg_epoch_loss=0.485]
100%|██████████| 50/50 [00:23<00:00,  2.09it/s, epoch=3/20, avg_epoch_loss=0.442]

0it [00:00, ?it/s]
49it [00:10,  4.88it/s, epoch=3/20, validation_avg_epoch_loss=0.327]
123it [00:24,  4.93it/s, epoch=3/20, validation_avg_epoch_loss=0.363]

after some time

  0%|          | 0/50 [00:00<?, ?it/s]
 50%|█████     | 25/50 [00:10<00:10,  2.48it/s, epoch=4/20, avg_epoch_loss=0.522]
WARNING:gluonts.trainer:Batch [46] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [47] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[3] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[3] gave NaN loss and it will be ignored

2 epochs later (avg_epoch_loss=0.766)

 92%|█████████▏| 46/50 [00:10<00:00,  4.57it/s, epoch=6/20, avg_epoch_loss=0.766]
WARNING:gluonts.trainer:Batch [47] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [48] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [49] of Epoch[5] gave NaN loss and it will be ignored
WARNING:gluonts.trainer:Batch [50] of Epoch[5] gave NaN loss and it will be ignored
100%|██████████| 50/50 [00:10<00:00,  4.63it/s, epoch=6/20, avg_epoch_loss=0.766]

Environment

Operating system: Linux [GCC 9.3.0]
Python version: 3.9.7
GluonTS version: 0.8.1
MXNet version: 1.8.0.post0

mbohlkeschneider commented 2 years ago

Hi @sevstafiev,

Thank you for raising this issue. One thing I noticed from your snipped is that the learning rate is rather high. I would suggest reducing it to 0.001 or even 0.0001. This often helps with the NaN loss problem.

Alit10 commented 2 years ago

Hello, In the issue #833 it's solved using the pytorch implementation. You can maybe try it like this. I had the same issue and using a different distribution like the studentOutput solves the nan problem

lostella commented 2 years ago

The following snippet appears to reproduce the issue quite consistently:

import pandas as pd
import numpy as np

def first_sunday_of_month_sale():
    idx = pd.date_range(start="2021-01-01", periods=365, freq="D")
    data = [np.random.randint(65, 95) if ts.weekday() == 6 and ts.day <= 7 else 0 for ts in idx]
    return pd.Series(data, index=idx)

series = first_sunday_of_month_sale()

from gluonts.dataset.common import ListDataset
from gluonts.mx.distribution import NegativeBinomialOutput
from gluonts.model.deepar import DeepAREstimator

dataset = ListDataset(
    data_iter=[{"start": series.index[0], "target": series.values}],
    freq="D",
)

deepar_estimator = DeepAREstimator(
    freq="D", 
    prediction_length=15,
    context_length=60, 
    distr_output=NegativeBinomialOutput(),
)

predictor = deepar_estimator.train(dataset)

Example output:

100%|██████████| 50/50 [00:04<00:00, 11.26it/s, epoch=1/100, avg_epoch_loss=1.25]
100%|██████████| 50/50 [00:04<00:00, 11.67it/s, epoch=2/100, avg_epoch_loss=0.35]
100%|██████████| 50/50 [00:03<00:00, 12.53it/s, epoch=3/100, avg_epoch_loss=0.298]
100%|██████████| 50/50 [00:03<00:00, 12.68it/s, epoch=4/100, avg_epoch_loss=0.283]
100%|██████████| 50/50 [00:03<00:00, 12.59it/s, epoch=5/100, avg_epoch_loss=0.274]
100%|██████████| 50/50 [00:04<00:00, 12.46it/s, epoch=6/100, avg_epoch_loss=0.273]
100%|██████████| 50/50 [00:03<00:00, 12.98it/s, epoch=7/100, avg_epoch_loss=0.265]
100%|██████████| 50/50 [00:03<00:00, 12.80it/s, epoch=8/100, avg_epoch_loss=0.262]
100%|██████████| 50/50 [00:03<00:00, 12.83it/s, epoch=9/100, avg_epoch_loss=0.253]
100%|██████████| 50/50 [00:03<00:00, 13.00it/s, epoch=10/100, avg_epoch_loss=0.25]
100%|██████████| 50/50 [00:04<00:00, 11.89it/s, epoch=11/100, avg_epoch_loss=0.241]
100%|██████████| 50/50 [00:03<00:00, 13.03it/s, epoch=12/100, avg_epoch_loss=0.236]
100%|██████████| 50/50 [00:03<00:00, 12.72it/s, epoch=13/100, avg_epoch_loss=0.226]
  0%|          | 0/50 [00:00<?, ?it/s]Batch [2] of Epoch[13] gave NaN loss and it will be ignored
Batch [6] of Epoch[13] gave NaN loss and it will be ignored
Batch [9] of Epoch[13] gave NaN loss and it will be ignored
Batch [11] of Epoch[13] gave NaN loss and it will be ignored
[...]

Edit: The PyTorch implementation indeed gives much more meaningful results:

from gluonts.torch.model.deepar import DeepAREstimator as TorchDeepAREstimator
from gluonts.torch.modules.distribution_output import NegativeBinomialOutput as TorchNegativeBinomialOutput

torch_deepar_estimator = TorchDeepAREstimator(
    freq="D", 
    prediction_length=15,
    context_length=60, 
    distr_output=TorchNegativeBinomialOutput(),
    trainer_kwargs=dict(max_epochs=10)
)

torch_predictor = torch_deepar_estimator.train(dataset)

Example result:

forecasts = list(torch_predictor.predict(dataset))
forecasts[0].plot()

The prediction is not perfect (the spike is on the second Sunday of the month, while training data displayed spikes on the first Sunday of the month) but definitely makes sense.

lostella commented 2 years ago

I think this suggests that the MXNet-based NegativeBinomial implementation has some problems, possibly with the way it's parametrized. Two things that could solve this:

Fix the NegativeBinomialOutput class to output parameters in the right range (I'm not sure alpha should be anything positive, and why should mu not allowed to be zero?)
Update NegativeBinomial to use a different parametrization based on the failure count and logit, like the one from PyTorch.

lostella commented 2 years ago

@sevstafiev apologies for the late intervention here, the problem may have been fixed with #1893: feel free to try again on your data using the code from the master branch, I'd be curious to see whether that was the issue here.

awslabs / gluonts