d2l-ai / d2l-en

Interactive deep learning book with multi-framework code, math, and discussions. Adopted at 500 universities from 70 countries including Stanford, MIT, Harvard, and Cambridge.
https://D2L.ai
Other
23.92k stars 4.36k forks source link

Sudden drops in plots of loss/accuracy with extended training of a neural network #1123

Closed NishantTharani closed 4 years ago

NishantTharani commented 4 years ago

Following along with the concise implementation of multiplayer perceptrons I then tried to train a neural network with one extra hidden layer, for 50 epochs instead of 10. The resulting plot of training loss/train acc/test acc exhibits sudden drops:

image

It does not always look like this - sometimes there are no drops and sometimes there is a drop and then a recovery followed by another drop, etc:

image

Comments by @AnirudhDagar on a forum post I made about this indicate that it could be an issue related to the plot function.

Here is a Jupyter notebook containing the code I ran: https://github.com/NishantTharani/GitSharing/blob/master/concise_multilayer_perceptrons.ipynb

AnirudhDagar commented 4 years ago

Hi @NishantTharani, I was able to verify the issue and reproduce your results. The reason for the sudden drops is because the training loss becomes NaN after a few epochs. You can fix this easily by using a smaller learning rate, which will keep this in control. A good default for SGD is lr=0.01 or 0.05. Please try reporting back your results with these. :)

astonzhang commented 4 years ago

@AnirudhDagar

1141, sum may cause the result wrong.

This result is obtained after one additional layer without re-tuning hyperparameters. How could sum cause this result? Could you please be clearer? Thanks.

astonzhang commented 4 years ago

@StevenJokes see my comments in https://github.com/d2l-ai/d2l-en/pull/1176

AnirudhDagar commented 4 years ago

I and @goldmermaid discussed about this issue and she suggested it is probably due to the loss becoming NaN. Later, I verified that with a reduced learning rate as can be seen in my last comment. This probably hints at exploding gradients issue. What do you think @astonzhang? I don't understand what Steven is suggesting.

astonzhang commented 4 years ago

@AnirudhDagar Thanks for checking. When modifying architectures, hyperparameters (e.g., lr) may need to re-tuned.

AnirudhDagar commented 4 years ago

@astonzhang Should we update the code in the chapter to use lr=0.05 instead of lr=0.5? Or we leave this to be figured out by the readers?

astonzhang commented 4 years ago

@astonzhang Should we update the code in the chapter to use lr=0.05 instead of lr=0.5? Or we leave this to be figured out by the readers?

When you changed it to 0.05, what acc did you get?

astonzhang commented 4 years ago

@AnirudhDagar , nvm, i just tested it and modified it to be 0.1

NishantTharani commented 4 years ago

Hi @AnirudhDagar sorry for the very late reply and thank you for investigating for me, I tried to reproduce it but for some reason couldn't step through to a point where the training loss became NaN.

In any case I tried changing it to 0.05 and the problem went away so I guess that's it