awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.55k stars 748 forks source link

Not able to achieve Reproducible training and results #1040

Closed Hari-pyt closed 3 years ago

Hari-pyt commented 4 years ago

HI,

import numpy as np
import mxnet as mx
np.random.seed(7)
mx.random.seed(7)

Trained a model by setting these seeds got my evaluation metrics, when I restart the jupyter notebook and train the model again with same parameters and data, I am getting different result from the first one. am i missing something to get reproducible results?

Thanks in advance

GabrielDeza commented 3 years ago

I am getting the exact same issue. Once I train my model I run it through 2 test dataset: the second test data set is slightly modified version of the first dataset. I noticed that when the modification factor is zero (ie: both test datasets are the exact same), I am getting different predictions. Here is a plot of the inputs (ie: the same) and the resulting different predictions 119940773_3500946126610753_6544762214763489639_n

Screen Shot 2020-09-21 at 1 51 36 PM

I was looking through the make_evaulation_predictions but I can't find find some sort of randomness that is somehow not covered by numpy and mxnet random seed.

lostella commented 3 years ago

@Hari7696 @GabrielDeza are you using multiprocessing or GPU?

Without multiprocessing, and running on CPU, I get always the same results (i.e. the same backtesting accuracy metrics) with the following code:

import numpy as np
import mxnet as mx
np.random.seed(7)
mx.random.seed(7)

from gluonts.dataset.repository.datasets import get_dataset
from gluonts.model.simple_feedforward import SimpleFeedForwardEstimator
from gluonts.model.deepar import DeepAREstimator
from gluonts.mx.trainer import Trainer
from gluonts.evaluation.backtest import make_evaluation_predictions
from gluonts.evaluation import Evaluator

dataset = get_dataset("m4_hourly")

estimator = DeepAREstimator(
    freq="H", prediction_length=6, trainer=Trainer(epochs=3, learning_rate=1e-3)
)

predictor = estimator.train(dataset.train, num_workers=None)

forecast_it, ts_it = make_evaluation_predictions(
    dataset=dataset.test,
    predictor=predictor,
    num_samples=100,
)

forecasts = list(forecast_it)
tss = list(ts_it)

evaluator = Evaluator(quantiles=[0.1, 0.5, 0.9])

agg_metrics, _ = evaluator(iter(tss), iter(forecasts), num_series=len(dataset.test))

import pprint
pprint.pprint(agg_metrics)

As soon as I set num_workers=2, results start changing from one run to the next. This is expected, since the OS scheduler is non-deterministic and this will affect the composition of training batches during training. I think GPU computations can cause similar issues.

Hari-pyt commented 3 years ago

@lostella Thank you for the response I didn't use multi processing or GPUs. I did the training once, but forecasted twice with same input data and the result is not same.

from gluonts.dataset.common import ListDataset
from gluonts.model.deepar import DeepAREstimator
from gluonts.trainer import Trainer
import pandas as pd
import numpy as np
import mxnet as mx
mx.random.seed(7)
np.random.seed(7)

#data generation
data = np.random.randint(1,100,10000)
start_date = pd.to_datetime('2020-08-30')
train_data = data[:9900]
test_data= data[-100:]

#train data
train_data = [{ 'start': start_date, 'target': train_data } ]
train_ds = ListDataset(train_data, freq= '1 min')

estimator = DeepAREstimator(freq="1 min",context_length = 100 , prediction_length = 100,
                        trainer = Trainer(epochs = 3, batch_size = 4, learning_rate =0.01))
predictor = estimator.train(train_ds, num_workers = None)

# first forecast
preds_gen = predictor.predict(train_ds)
predictions = list(preds_gen)[0].mean
print("absolute error mean", abs(predictions - test_data).mean())
absolute error mean 26.631500854492188
# second forecast with same data
preds_gen = predictor.predict(train_ds)
predictions = list(preds_gen)[0].mean
print("absolute error mean", abs(predictions - test_data).mean())
absolute error mean 26.794510917663573
lostella commented 3 years ago

@Hari7696 I see — in this case some difference is expected, since the model draws a certain number of sample paths at prediction time. You can check that this difference disappears if you set the seeds right before predicting (both before the first and the second predictions). That is, doing the following twice should give the same result:

mx.random.seed(7)
np.random.seed(7)
preds_gen = predictor.predict(train_ds)
predictions = list(preds_gen)[0].mean
print("absolute error mean", abs(predictions - test_data).mean())
Hari-pyt commented 3 years ago

@lostella This resolved the problem I met with, Thank you.