awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.52k stars 744 forks source link

Our evaluations page is not reproducible and misleading. #834

Open AaronSpieler opened 4 years ago

AaronSpieler commented 4 years ago

Description

We have the https://github.com/awslabs/gluon-ts/tree/master/evaluations page were we claim that: "The goal is to make reproducibility and comparison easier by versioning the code producing dataset as well as the model and evaluation code." However, as is, that is not the case.

Problems

Suggestions

There are a few possibilities here:

geoalgo commented 4 years ago

Hi Aaron,

"Our evaluations page is not reproducible and misleading".

This statement is perhaps a bit strong, as you say the goal is stated as making "reproducibility and comparison easier by versioning the code producing dataset as well as the model and evaluation code.". For that statement to be wrong or misleading, it would mean that the evaluations would be easier without this script :-)

That being said, I agree with your suggestions: it makes lots of sense and I believe very impactful!

Btw: 3-5 times per dataset, per method, that's a lot perhaps one seed is enough and the results can be tracked over time.

AaronSpieler commented 4 years ago

@geoalgo True, it might be worded a bit strongly.

However, I remember there have been issues raised before regarding the reproducibility of our results on the page: in the sense that people couldn't reproduce them, and thought the deviation was enough to open an issue about it.

While it might be true that we technically only claim that we want to make it easier, it is of course understood as making it rather more than less reproducible, and given so many concerns that I raised above regarding the way new values are inputted, were generated, and in general are not transparent regarding their origin and context, personally I would find it hard given the table, and say a run locally to deduce any meaningful conclusion about the performance of the model, or the reasons for possible deviations.

So the question remains, is it reproducible? from my experiments and understanding: it is not.

And given that it's not reproducible, and there are various problems with the methodology as mentioned above it is also misleading.

So even though we might not claim that it is reproducible, the fact remains it is not, and additionally, it is misleading.

AaronSpieler commented 4 years ago

@geoalgo The problem would be, that even given the same seed, if we make changes to the model, the seed cannot guarantee that we can the same relative performance. I.e. if the runs suffer from high variance, which they do, running a model with a single seed would only work if there are no changes in neither the model, nor it's dependencies.

We need 3-5 runs to report meaningful averages, and deviations. Anyone who wants to compare their values to the values reported by us could then run a single run and decide for them selves what that mean, given the mean and deviation we report.