Our evaluations page is not reproducible and misleading.

AaronSpieler commented 4 years ago

Description

We have the https://github.com/awslabs/gluon-ts/tree/master/evaluations page were we claim that: "The goal is to make reproducibility and comparison easier by versioning the code producing dataset as well as the model and evaluation code." However, as is, that is not the case.

Problems

Some values were generated using non default Trainers, as the results shown for TransformerEstimator, so when someone tries to run the evaluation for TransformerEstimator they will get different, worse results. The Training parameters were the same as in the https://arxiv.org/abs/1906.05264 paper. This is not documented.
Additionally, the above case with the Transformer is the one case where I am aware of this, however, I wonder whether this is the case for maybe other models like DeepAR, were the official default trainer is Trainer(), however, intuitively I would have to guess that not only a 100 epochs were used to generate these values, which would be the case if Trainer() was used.
Related to the above: some Models have non standard default Trainers, e.g. DeepState https://github.com/awslabs/gluon-ts/blob/master/src/gluonts/model/deepstate/_estimator.py#L161 has hybridize=False. This can be due to the way we define Estimators; including the default Trainer. This is not transparent in the Evaluation: from looking at the table one would think that differences are due to intrinsic properties of the models as described in their corresponding papers, and not due to setting different values for num_epochs in the Trainer.
The variance of the evaluations for some models is too high to justify meaningful comparisons. We have raised this issue before which is why we are working on regression tests.
The problem with updating the values for just one model is that so many things can change unrelated to the model implementation that can affect model performance. For example any change in the default parameter of the Trainer (which has happened) and changes in the data loader (like different default shuffling parameters, or bugs that prevented the traversion of the whole datasets for certain configurations of batch_size and num_batches_per_epoch). Essentially these non transparent changes alone impair reproducibility to the point were its unlikely that even if we had deterministic models we could reproduce the results.

Suggestions

There are a few possibilities here:

Correct the above claim and highlight the limitations of this evaluation, essentially pointing out the above mentioned problems
Have an automated script that takes care of generating these values: anything else would be non reproducible: If we cannot strictly control how these values are generated we cannot guarantee that someone created these values by running an outdated GluonTS version in a container and running the evaluations of the models there and submitting it under a different GluonTS version.
Regression tests are necessary, i.e. running the tests at least 3-5 times per model per dataset as discussed here: https://github.com/awslabs/gluon-ts/issues/379
We need to update the default Trainer for most models to reflect what we consider to be great default training parameters of the model to have a fair comparison, this includes but is not limited to DeepAR, NBeats, MQCNN & MQRNN, Transformer DeepVar, DeepGP and essentially all that have a default trainer: Trainer() which is unlikely the best solution.

geoalgo commented 4 years ago

Hi Aaron,

"Our evaluations page is not reproducible and misleading".

This statement is perhaps a bit strong, as you say the goal is stated as making "reproducibility and comparison easier by versioning the code producing dataset as well as the model and evaluation code.". For that statement to be wrong or misleading, it would mean that the evaluations would be easier without this script :-)

That being said, I agree with your suggestions: it makes lots of sense and I believe very impactful!

Btw: 3-5 times per dataset, per method, that's a lot perhaps one seed is enough and the results can be tracked over time.

AaronSpieler commented 4 years ago

@geoalgo True, it might be worded a bit strongly.

However, I remember there have been issues raised before regarding the reproducibility of our results on the page: in the sense that people couldn't reproduce them, and thought the deviation was enough to open an issue about it.

While it might be true that we technically only claim that we want to make it easier, it is of course understood as making it rather more than less reproducible, and given so many concerns that I raised above regarding the way new values are inputted, were generated, and in general are not transparent regarding their origin and context, personally I would find it hard given the table, and say a run locally to deduce any meaningful conclusion about the performance of the model, or the reasons for possible deviations.

So the question remains, is it reproducible? from my experiments and understanding: it is not.

And given that it's not reproducible, and there are various problems with the methodology as mentioned above it is also misleading.

So even though we might not claim that it is reproducible, the fact remains it is not, and additionally, it is misleading.

AaronSpieler commented 4 years ago

@geoalgo The problem would be, that even given the same seed, if we make changes to the model, the seed cannot guarantee that we can the same relative performance. I.e. if the runs suffer from high variance, which they do, running a model with a single seed would only work if there are no changes in neither the model, nor it's dependencies.

We need 3-5 runs to report meaningful averages, and deviations. Anyone who wants to compare their values to the values reported by us could then run a single run and decide for them selves what that mean, given the mean and deviation we report.

awslabs / gluonts