NervanaSystems / deepspeech

DeepSpeech neon implementation
Apache License 2.0
222 stars 69 forks source link

nan cost #37

Open basant-kumar opened 7 years ago

basant-kumar commented 7 years ago

hi, I'm getting nan cost after resuming the training for the pre-trained model (librispeech_16_epochs.prm). the cost becomes nan after 16/17 epoch and the testing results (after each epoch) are null.

OS: Ubuntu 16.04 GPU: Nvidia Titan-X Pascal (12GB RAM) Neon: version 1.9.0

tyler-nervana commented 7 years ago

Could you share a bit more details on your setup? We haven't seen this behavior. What is the command you are running to train further? Which dataset are you using? Is there anything different about your data from the librispeech dataset?

gardenia22 commented 7 years ago

I am getting the same problem. My audio data are in wav format other than flac. Is this a problem? following is my command: python train.py --manifest train:data/train_1700hour.csv --manifest val:data/dev_1700hour.csv -e 20 -z 12 -s model/ds2_1700hour_20_epochs.prm --model_file model/librispeech_16_epochs.prm

gardenia22 commented 7 years ago

My transcription files have '\n' in the file, which leads to nan cost problem.

tyler-nervana commented 7 years ago

Thanks for the quick update. Currently anything in the transcript files is treated as a character, including "\n".

pankaj2701 commented 7 years ago

I also get the same problem, when .wav files are used. When I converted the files to flac files then the nan value problem did not appear.

tyler-nervana commented 7 years ago

Thanks for noticing the difficulty with .wav files. We'll take a look.

Drea1989 commented 7 years ago

hello, i write here because i encountered a problem with nan cost as well. I am using Neon 2.0 for python 2.7 on Ubuntu 16.04 using GTX1080 backend.

in my case i am using librispeech train-500-other and after 50-60% of the epoch the cost becomes nan. i have tried training the model only using the other libispeech packages and it trains as expected. any thoughts on this?

Drea1989 commented 6 years ago

i was able to fix the issue by dropping the learning rate of 2 order of magnitude, the issue was apparently due to an infinite cost caused by a prediction being too certain of a very wrong value.