training process - Githubissues

amirim commented 6 years ago

I ran the demo according to your instructions on ubuntu: python train.py --epochs 100 --patience 10

and the results are: Epoch 21 complete! Average Training loss: 0.358191079584 Validation loss is: 1.0758594363 Validation binary accuracy is: 0.685589519651 MAE on test set is 1.08729708925 Binary accuracy on test set is 0.69970845481 Precision on test set is 0.675958188153 Recall on test set is 0.631921824104 F1 score on test set is 0.653198653199 Seven-class accuracy on test set is 0.316326530612 Correlation w.r.t human evaluation on test set is 0.544894550488

So my question is why the process ends after 21 epochs (not 100)

Justin1904 commented 6 years ago

It is a common practice to set an early stopping criterion using the validation set (you'll be familiar with this if you've used Keras before - they've a built-in function for this). Concretely, when training, we not only monitor the training loss, but also keep track of the performance on validation set. When training loss continues to go down yet validation loss plateaus or goes up, you can tell that the model may start to overfit. However, due to the randomness introduced by mini-batch training, it is too reckless to stop training immediately whenever you see validation loss stops decreasing, so instead we set a "--patience" argument, which defines how many epochs do we want to wait before we actually decide to early stop the training process.

In the case you mentioned, it must be that your validation loss reaches the lowest at epoch 11, and after the 11-th epoch validation loss never went lower than that. Since the "--patience" argument is set to 10, you still get to train the model for 10 more epochs, but in that 10 additional epochs validation loss never went down below the previous best. So at epoch 21 the training early stops.

amirim commented 6 years ago

I understood, many thanks for the detailed explanation.

Justin1904 / TensorFusionNetworks

training process #4