awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.52k stars 744 forks source link

Predict probability distribution on unseen data #763

Open robsannaa opened 4 years ago

robsannaa commented 4 years ago

I am trying to implement at my company a probabilistic model using gluon-ts.

At the moment I am experimenting with the Python API, however, in the future, I am planning to move my codebase to Sagemaker.

Following the example from the readme file, it is not clear to me how to predict unseen data: the example creates a test set predicting values and comparing them with the real value.

However, how can I predict unseen data, for example, considering the data from the readme file, how can I predictor object to predict values that are, of course, not in the training dataset?

I have tried to train a DeepAREstimator using a custom dataset, containing daily data, up to 2019-10-14, with a prediction_length of 90 days. gluon

Then I created a test_data dataset using the following snippet:

test_data = ListDataset( [ {"start": df.index[0], "target": df.Total[:pd.to_datetime("2019-10-14")]} ], freq = "1D" )

Where my goal is to predict 90 days in the future, I try to achieve this with this snippet:

from gluonts.evaluation.backtest import make_evaluation_predictions

def plot_forecasts(tss, forecasts, past_length, num_plots):

    for target, forecast in islice(zip(tss, forecasts), num_plots):

        ax = target[-past_length:].plot(figsize=(15, 4), linewidth=2)
        forecast.plot(color='g')

        plt.grid(which='both')
        plt.legend(["observations", "median prediction", "90% confidence interval", "50% confidence interval"])
        plt.show()

forecast_it, ts_it = make_evaluation_predictions(test_data, predictor=predictor, num_samples=100)

forecasts = list(forecast_it)
tss = list(ts_it)

plot_forecasts(tss, forecasts, past_length=200, num_plots=5)

However, the predictions seem to be rolled back by 90 days instead of predicting 90 days in the future:

image

It seems there is something I am missing about predicting on unseen data or using a predictor to predict the probability distribution of new data.

What am I missing?

sayonsom commented 4 years ago

I second @robertosannazzaro question. The documentation does not specify clearly how to do predictions on unforeseen data. I also started to learn Gluon-TS just two days ago. I am experimenting with a heuristic for predicting future dates. It's like this (so far):

prediction_intervals = (50.0, 90.0) legend = ["observations", "median prediction"] + [f"{k}% prediction interval" for k in prediction_intervals][::-1]

fig, ax = plt.subplots(1, 1, figsize=(10, 7)) tss[0][-samples_per_day*num_of_days:].plot(ax=ax) # plot the time series predictions[0].plot(prediction_intervals=prediction_intervals, color='g') plt.grid(which="both") plt.legend(legend, loc="upper left") plt.show()


I am also curious to learn if what I am doing is the right approach, or what should I be doing instead? 
AaronSpieler commented 4 years ago

I am trying to implement at my company a probabilistic model using gluon-ts.

At the moment I am experimenting with the Python API, however, in the future, I am planning to move my codebase to Sagemaker.

Following the example from the readme file, it is not clear to me how to predict unseen data: the example creates a test set predicting values and comparing them with the real value.

However, how can I predict unseen data, for example, considering the data from the readme file, how can I predictor object to predict values that are, of course, not in the training dataset?

I have tried to train a DeepAREstimator using a custom dataset, containing daily data, up to 2019-10-14, with a prediction_length of 90 days. gluon

Then I created a test_data dataset using the following snippet:

test_data = ListDataset( [ {"start": df.index[0], "target": df.Total[:pd.to_datetime("2019-10-14")]} ], freq = "1D" )

Where my goal is to predict 90 days in the future, I try to achieve this with this snippet:

from gluonts.evaluation.backtest import make_evaluation_predictions

def plot_forecasts(tss, forecasts, past_length, num_plots):

    for target, forecast in islice(zip(tss, forecasts), num_plots):

        ax = target[-past_length:].plot(figsize=(15, 4), linewidth=2)
        forecast.plot(color='g')

        plt.grid(which='both')
        plt.legend(["observations", "median prediction", "90% confidence interval", "50% confidence interval"])
        plt.show()

forecast_it, ts_it = make_evaluation_predictions(test_data, predictor=predictor, num_samples=100)

forecasts = list(forecast_it)
tss = list(ts_it)

plot_forecasts(tss, forecasts, past_length=200, num_plots=5)

However, the predictions seem to be rolled back by 90 days instead of predicting 90 days in the future:

image

It seems there is something I am missing about predicting on unseen data or using a predictor to predict the probability distribution of new data.

What am I missing?

What you did is not incorrect, however, also not exactly what you wanted. You successfully trained a network and confirmed that the model predictions actually match the observations (as seen in your plot from the latest 90 days.), which is the intended purpose of make_evaluation_predictions: https://gluon-ts.mxnet.io/api/gluonts/gluonts.evaluation.backtest.html?highlight=make_evaluation_predictions#gluonts.evaluation.backtest.make_evaluation_predictions.

If you want to predict into the future, then just use the predictor you got and use the predict function of the predictor using your test data. See section 5 of the extended tutorial: https://gluon-ts.mxnet.io/examples/extended_forecasting_tutorial/extended_tutorial.html : "A Predictor defines the predictor.predict method of a given predictor. This method takes the test dataset, it passes it through the prediction network to take the predictions, and yields the predictions. You can think of the Predictor object as a wrapper of the prediction network that defines its predict method."

sayonsom commented 4 years ago

@AaronSpieler Thanks for your reply. I am confused a little bit on the documentation here, I just need a tiny clarifying example.

I have a trained prediction network of the type gluonts.model.predictor.RepresentableBlockPredictor Using this, I am trying to forecast values for a future date. Now, the predict() method of this class takes the dataset parameter, which the documentation says as:

The dataset containing the time series to predict.

What I am trying to understand is the ListDataset type that I pass to the predict method - should it only contain the timestamps, or also corresponding values? For example, if I have data until 07/10/2019, which I have used to build, train, test the prediction network. Now I want to predict from 07/11/2019 - can my ListDataset look like this:

|Timestamp|measured_value| |07/11/2019| NaN | |07/12/2019| NaN | |07/13/2019| NaN | ...

Also, the documentation does not give details about the difference between predict and predict_item(). Your clarification will be greatly appreciated.

robsannaa commented 4 years ago

I support @sayonsom question, I believe as well the documentation lacks a bit in explaining how to predict unseen data, while it focuses extensively on model training and testing.

A small example showing how to forecast unseen data would be great!

jaheba commented 4 years ago

I think the problem is that our documentation is more aimed people who want to develop new models, rather than use models for predictions.

Generally speaking, make_evaluation_predictions is not the right place to make predictions. It is there to evaluate the accuracy of a model given a test-dataset.


Our general idea is this:

predictor = Estimator(...).train(train_dataset)
predictions = predictor.predict(dataset)

A dataset is a collection of items, and an item is just a dictionary containing certain values. The format of these is also explained here.

One important thing is that the start field always marks the date-time of the first value in target (and all time-dependent fields).

When doing predictions, we do so from the last given value on.

For example, assuming monthly frequency:

# prediction_length = 3
forecasts = predictor.predict(ListDataset([{"start": "2020-01",  "target": [1, 2, 3, 4, 5]}])

The prediction would contain 3 values, for months: 2020-06, 2020-07, 2020-08.

Generally, you want to pass in as much data as possible, since the algorithms take what they need from it.


Also, the documentation does not give details about the difference between predict and predict_item(). Your clarification will be greatly appreciated.

Often, networks can predict multiple items at the same time, and thus we decided that predict takes a dataset for optimisation reasons. On the other hand, other algorithms may only predict one time-series at the time. Thus, we also have predict_item in some places, which just forecasts a single item and implemented predict for these Estimators to just call predict_item internally.

For now, the official contract is to use predict.