Open lostella opened 3 years ago
I think what's weird is that we have to stitch together input, predictions and the expected values.
Internally, I guess we would like to work with a format like this for test-data:
{"target": [1, 2, 3, 4], "true-values": [5, 6, 7]}
We could then add the prediction to the dict, basically treat it as a simple transformation:
{"target": [1, 2, 3, 4], "true-values": [5, 6, 7], "prediction": [5, 5, 8]}
That would simplify a lot of our evaluation code.
I think what's weird is that we have to stitch together input, predictions and the expected values.
That's just zip
!
Yes, but why does the user have to do it?
What I don't like about putting the ground truth together with the rest is that the model may look into it. In
forecasts = model.predict(test_input)
metrics = evaluator(forecasts, test_expected_output)
the model cannot possibly look into it. Also no zipping is done by the user, it's the evaluation code that does it.
But more importantly, this would simplify:
Currently, each metric-function is called with specific arguments. Instead we could pass the object containing all information and the metric then picks the ones it needs.
forecasts = model.predict(test_input)
metrics = evaluator(forecasts, test_expected_output)
For me this interface looks like bad design. As a user I have to manage the test_input and then ensure that the expected_output
is aligned correctly.
Why can't we just do:
model.evalutate(test_data)
Why can't we just do:
model.evalutate(test_data)
Because the model's task is that of giving predictions, not evaluating them. You usually evaluate multiple models and pick the best: if each model does it its own way, than that's confusing.
For me this interface looks like bad design. As a user I have to manage the test_input and then ensure that the
expected_output
is aligned correctly.
I find it very explicit (== good) instead
Currently, datasets used for experiments are provided with a fixed training/test split and some metadata. The implicit assumption is that training data will be used to train the model, and that the model will be evaluated on the last
prediction_length
observations in the test series, whereprediction_length
is contained in the metadata.We have seen this implicit assumption create uncertainty in users ("how should I use the test data? what does
make_evaluation_predictions
do under the hood? should I slice the final part? should I input the training data to the model, and compare the output with test?") and could be removed.Instead of providing "experiment" data as
train
test
One could have (names are just placeholders)
train
test_input
test_expected_output
Where
test_expected_output
is exactlyprediction_length
long. This would make things more explicit:and the gluonts.evaluation.backtest module would essentially become obsolete.
Edit: More thoughts on this. The change could be done in a backwards compatible way, by keeping (but deprecating) the
test
property of datasets, and defining the split fields in terms oftest
. Similarly for thegluonts.evaluation.backtest
utilities, they could be deprecated initially.