YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.17k stars 221 forks source link

Epoch: [4][160156/161048] training diverged... #102

Open xiaoli1996 opened 1 year ago

xiaoli1996 commented 1 year ago

Hi! Yaun Gong, Great job! I use the same hyperparameter by your GitHub code but when training "Epoch: [4][160156/161048]" appears "Train Loss is nan".

The results of the 3 epochs are: 0.415, 0.439, 0,447, Compare the results given in your log: 0.415, 0.439, 0,448, 0.449, 0.449

My torch version is 2.0.0, So why does this happen?

xiaoli1996 commented 1 year ago

image

YuanGongND commented 1 year ago

hi there,

The nan error can be due to an overflow/underflow - it is hard for me to identify the exact reason. It might be related to pytorch and hardware.

You could try two workarounds:

-Yuan

xiaoli1996 commented 1 year ago

Thanks for the suggestion, I will run it with a lower version of torch.