Missing divisor in NegBin alpha parameter scaling

jsirol commented 4 years ago

Description

The scaling of the Negative Binomial parameter alpha was altered in #719. It was discussed in the thread that the scaling should be alpha' = alpha + (scale-1)/(scale*mu). I think this is correct, but in the code it reads alpha = F.broadcast_add(alpha, F.broadcast_div(scale - 1, mu)), hence missing the scale in denominator. I believe this is just a mistake that slipped in?

Also likely related, @MaximilianProll and myself noticed that after bumping the GluonTS version to 0.5 from 0.4.2 our results with DeepAR + NegBin are consistently significantly worse than when using 0.4.2 on the same data, which led us to go through the recent changes. In particular, the scale of the forecasts in 0.5 seems to be off (too big forecasts) so the mentioned issue above could seem like a possible culprit.

To Reproduce

Unfortunately I am unable to provide code and data for verification at this time, however the mentioned part of the code is easy to verify.

Environment

Run on Sagemaker using the MXNet estimator.

Python version: 3.6
GluonTS version: 0.5
MXNet version: 1.6.0

lostella commented 4 years ago

@jsirol mu is multiplied by scale in the previous line, so I think the implementation agrees with the math, do you agree?

lostella commented 4 years ago

@jsirol Regarding the performance degradation: due to the significant change in the scaling of NegativeBinomial, it might be that hyperparameters settings that used to give good results now don’t perform as well (I’m thinking mainly about training settings like about learning rate or number of epochs).

Did you run with the same exact settings in 0.4.2 and 0.5.0?

jsirol commented 4 years ago

@lostella Thanks for swift response! Yes, you are correct! I was missing the re-assignment to mu on the preceding line, so the implementation + math indeed looks correct 👌

What comes to the performance degration and the hyperparameters, we kept learning rate and number of epochs the same between 0.4.2 and 0.5.0 (1e-3 and 100 respectively). We have tried multiple model architectures (varying cell type, num cells and layers) for both versions and trained and evaluated the models over several time periods. The loss and other metrics we are measuring are pretty consistent over the models and time periods on each given version of GluonTS, but on a different scale, e.g. 2-4x worse MAE with 0.5 compared to what we got to using 0.4.2. We are going to dig a bit deeper to see if there is something to be done regarding the model convergence and the related hyperparameters.

lostella commented 4 years ago

@jsirol are you running on GPU and with hybridize=True?

jsirol commented 4 years ago

@lostella Running on CPU and hybridize=True. Below is an example of the estimator call used. As mentioned in the previous comment, we have mostly experimented with the Estimator parameters such as number of layers, and not touched the Trainer part (except for epochs) as these values seemed to work fine.

from gluonts.model.deepar import DeepAREstimator
from gluonts.distribution import NegativeBinomialOutput
from gluonts.trainer import Trainer

estimator = DeepAREstimator(
    prediction_length=336,
    context_length=336,
    num_layers=1,
    num_cells=32,
    cell_type="gru",
    distr_output=NegativeBinomialOutput(),
    freq="1H",
    trainer=Trainer(
        ctx="cpu",
        epochs=100,
        learning_rate=1e-3,
        num_batches_per_epoch=100,
        batch_size=32,
    ),
)

jsirol commented 4 years ago

Closing the issue since the original question about the implementation was answered and there seems to be no issue.

Regarding the performance degradation, we verified again that the same Estimator/Trainer settings which were used and optimised for 0.4.2 are clearly worse in 0.5 for our data. We will try to see if we can find a combination of hyperparameters for 0.5 that yields comparable results to 0.4.2.

awslabs / gluonts