Epoch: [4][160156/161048] training diverged...

YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".

BSD 3-Clause "New" or "Revised" License

1.06k stars 203 forks source link

Epoch: [4][160156/161048] training diverged... #102

Open xiaoli1996 opened 1 year ago

xiaoli1996 commented 1 year ago

Hi! Yaun Gong, Great job! I use the same hyperparameter by your GitHub code but when training "Epoch: [4][160156/161048]" appears "Train Loss is nan".

The results of the 3 epochs are: 0.415, 0.439, 0,447, Compare the results given in your log: 0.415, 0.439, 0,448, 0.449, 0.449

My torch version is 2.0.0, So why does this happen？

xiaoli1996 commented 1 year ago

YuanGongND commented 1 year ago

hi there,

The nan error can be due to an overflow/underflow - it is hard for me to identify the exact reason. It might be related to pytorch and hardware.

You could try two workarounds:

Run the experiment again and see if this error exists
We used a lower torch and torchaudio version at 2021. Please see https://github.com/YuanGongND/ast/blob/master/requirements.txt, you could try create a virtual environment with our version.

-Yuan

xiaoli1996 commented 1 year ago

Thanks for the suggestion, I will run it with a lower version of torch.