Improving the structure of test datasets

lostella commented 3 years ago

Currently, datasets used for experiments are provided with a fixed training/test split and some metadata. The implicit assumption is that training data will be used to train the model, and that the model will be evaluated on the last prediction_length observations in the test series, where prediction_length is contained in the metadata.

We have seen this implicit assumption create uncertainty in users ("how should I use the test data? what does make_evaluation_predictions do under the hood? should I slice the final part? should I input the training data to the model, and compare the output with test?") and could be removed.

Instead of providing "experiment" data as

train
test

One could have (names are just placeholders)

train
test_input
test_expected_output

Where test_expected_output is exactly prediction_length long. This would make things more explicit:

forecasts = model.predict(test_input)
metrics = evaluator(forecasts, test_expected_output)

and the gluonts.evaluation.backtest module would essentially become obsolete.

Edit: More thoughts on this. The change could be done in a backwards compatible way, by keeping (but deprecating) the test property of datasets, and defining the split fields in terms of test. Similarly for the gluonts.evaluation.backtest utilities, they could be deprecated initially.

jaheba commented 3 years ago

I think what's weird is that we have to stitch together input, predictions and the expected values.

Internally, I guess we would like to work with a format like this for test-data:

{"target": [1, 2, 3, 4], "true-values": [5, 6, 7]}

We could then add the prediction to the dict, basically treat it as a simple transformation:

{"target": [1, 2, 3, 4], "true-values": [5, 6, 7], "prediction": [5, 5, 8]}

That would simplify a lot of our evaluation code.

lostella commented 3 years ago

I think what's weird is that we have to stitch together input, predictions and the expected values.

That's just zip!

jaheba commented 3 years ago

Yes, but why does the user have to do it?

lostella commented 3 years ago

What I don't like about putting the ground truth together with the rest is that the model may look into it. In

forecasts = model.predict(test_input)
metrics = evaluator(forecasts, test_expected_output)

the model cannot possibly look into it. Also no zipping is done by the user, it's the evaluation code that does it.

jaheba commented 3 years ago

But more importantly, this would simplify:

https://github.com/awslabs/gluon-ts/blob/76fb746121e8b67c4b6b59db01f8ad682a3005e5/src/gluonts/evaluation/metrics.py#L62-L68

Currently, each metric-function is called with specific arguments. Instead we could pass the object containing all information and the metric then picks the ones it needs.

jaheba commented 3 years ago

forecasts = model.predict(test_input)
metrics = evaluator(forecasts, test_expected_output)

For me this interface looks like bad design. As a user I have to manage the test_input and then ensure that the expected_output is aligned correctly.

Why can't we just do:

model.evalutate(test_data)

lostella commented 3 years ago

Why can't we just do:
model.evalutate(test_data)

Because the model's task is that of giving predictions, not evaluating them. You usually evaluate multiple models and pick the best: if each model does it its own way, than that's confusing.

lostella commented 3 years ago

For me this interface looks like bad design. As a user I have to manage the test_input and then ensure that the expected_output is aligned correctly.

I find it very explicit (== good) instead

awslabs / gluonts

Improving the structure of test datasets #1447