awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.44k stars 742 forks source link

Validation-based early stopping #1184

Open DayanSiddiquiNXD opened 3 years ago

DayanSiddiquiNXD commented 3 years ago

https://en.wikipedia.org/wiki/Early_stopping#Validation-based_early_stopping

early stopping on the basis of validation loss. I have been looking for this in gluonTS but have not been able to find it. I have found learning rate (patience) based early stopping (https://github.com/awslabs/gluon-ts/issues/555, and https://github.com/awslabs/gluon-ts/pull/701) which is also great but I think val loss early stopping would be great.

kaijennissen commented 3 years ago

Couldn't this be achieved by setting patience=1 and learning_rate=minimum_learning_rate in the Trainer?

DayanSiddiquiNXD commented 3 years ago

@kaijennissen depends on which loss the patience window is applied to. i'm not too familiar with the source code so i cant tell. but if when the time since the loss improved is tested against the patience window in the learning_rate_scheduler.py, it tests val loss as well, and will stop if val loss has not improved, then yeah that will implement val-based early stopping.

can anyone confirm that it does?

kaijennissen commented 3 years ago

https://github.com/awslabs/gluon-ts/blob/7dabd947c961954c5c11e37cdc373950b930761c/src/gluonts/mx/trainer/_base.py#L364-L369

and

https://github.com/awslabs/gluon-ts/blob/7dabd947c961954c5c11e37cdc373950b930761c/src/gluonts/mx/trainer/_base.py#L380 in combination with

https://github.com/awslabs/gluon-ts/blob/e52864f7ee5d173dac38e7a984b9ea615397e2f2/src/gluonts/mx/trainer/learning_rate_scheduler.py#L124-L129

But I would be happy if someone else could confirm.

DayanSiddiquiNXD commented 3 years ago

@kaijennissen another problem with the lr approach is latching on to a local minima. the val loss epoch curve is not generally as smooth as the theory suggests so it will have local minima and the lr approach will latch on to the first one. true val based early stopping would have callback mechanism that allows the model to train well into overfit territory and then revert back to the weights that resulted in the global optima. this issue (https://github.com/awslabs/gluon-ts/issues/706) shows that callback isnt implemented so that isnt possible right now im guessing.

increasing the patience window will not really work because while it will decrease the chances of latching onto a local optima, it will a) still have the possibility of missing the global optima if a local optima occurs after it and makes the model think it has "improved", and b), without a callback mechanism, only be able to get the weights for the (optimal epoch + patience window) epoch number which is already overfit, and as we increase the patience window size to decrease the probability of capturing a local optima, we overshoot the optimal set of weights by a larger margin.

while the lr approach will work in a pinch (i'm gonna use it for my project, so thanks for highlighting it for me) i think this issue should remain open so the real callback based val loss early stopping can be implemented

PascalIversen commented 3 years ago

1168 implements callbacks. Not yet merged, but you can pull from my branch. This example explains how, on that branch, custom callbacks can be implemented. As an example, I implemented an early stopping callback which monitors a metric of interest, maybe that is of interest to you.

DayanSiddiquiNXD commented 3 years ago

@PascalIversen that is definitely very helpful. thanks alot. i'll see if i can use it to add true val based early stopping to my model, and if i can then i guess this issue can be closed

DAT-FYAYC commented 3 years ago

@DayanSiddiquiNXD did you get true val based early stopping with callsbacks running yet? Would be great to hear if this is working.

bradyneal commented 3 years ago

@DayanSiddiquiNXD or any maintainers, any chance you got early stopping implemented? It is a bit odd that there isn't already a simple flag for this, no?

DayanSiddiquiNXD commented 3 years ago

@davidtiefenthaler yes i did. i followed @PascalIversen 's advice to pull from his branch. the only thing i needed to change in the source code was the importing of utils as there were two places to import from and the branch was importing from the wrong one. but this was a while ago so it may have been fixed now

lostella commented 3 years ago

@bradyneal an early stopping mechanism is there already in the Trainer class, see here. This is still a bit too implicit, but essentially the learning rate reduction mechanism stops the training loop as soon as the learning rate goes below minimum_learning_rate. So one can play with the trainer options there to tune how aggressive early stopping should be.

There's PR #1168 that proposes a more explicit set of callbacks, with which one can customize the stopping condition more easily. We hope to get that merged soon.