baidu-research / ba-dls-deepspeech

Apache License 2.0
486 stars 174 forks source link

loss value becomes NaN after 7 epoch in training phase #28

Open faruk-ahmad opened 7 years ago

faruk-ahmad commented 7 years ago

we are using the implementation for training our own model. We have preprocessed the dataset with the given scripts. But after 7 epoch of training , the loss becomes 'nan'. What can be the possible cause? Here is the last part of training-log file: 2017-08-14 17:43:18,105 INFO (main) Epoch: 6, Iteration: 50, Loss: 79.07887268066406 2017-08-14 17:44:07,976 INFO (data_generator) Iters: 6 2017-08-14 17:44:25,415 INFO (utils) Checkpointing model to: ./model/ 2017-08-14 17:44:25,805 INFO (data_generator) Iters: 54 2017-08-14 17:44:37,734 INFO (main) Epoch: 7, Iteration: 0, Loss: 86.4606704711914 2017-08-14 17:47:57,033 INFO (main) Epoch: 7, Iteration: 10, Loss: 79.75791931152344 2017-08-14 17:51:46,978 INFO (main) Epoch: 7, Iteration: 20, Loss: 81.86383819580078 2017-08-14 17:56:00,494 INFO (main) Epoch: 7, Iteration: 30, Loss: 83.92363739013672 2017-08-14 18:00:55,395 INFO (main) Epoch: 7, Iteration: 40, Loss: 71.31178283691406 2017-08-14 18:06:30,210 INFO (main) Epoch: 7, Iteration: 50, Loss: 85.3790054321289 2017-08-14 18:08:03,423 INFO (data_generator) Iters: 6 2017-08-14 18:08:27,113 INFO (utils) Checkpointing model to: ./model/ 2017-08-14 18:08:27,578 INFO (data_generator) Iters: 54 2017-08-14 18:08:57,878 INFO (main) Epoch: 8, Iteration: 0, Loss: 61.189476013183594 2017-08-14 18:14:00,523 INFO (main) Epoch: 8, Iteration: 10, Loss: 98.21914672851562 2017-08-14 18:18:31,384 INFO (main) Epoch: 8, Iteration: 20, Loss: 84.95768819580078 2017-08-14 18:23:51,395 INFO (main) Epoch: 8, Iteration: 30, Loss: nan

N.B. We are using CPU machine, core-i5 with 32GB memory.

Any help would be appreciated. Thanks in advance.

zhangdapeng1207 commented 7 years ago

I have the same problem too. Do you know how to solve this? Thanks!