NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Jasper training from scratch `invalid value encountered in true_divide` #325

Closed darraghdog closed 5 years ago

darraghdog commented 5 years ago

Describe the bug

Training from scratch on Jasper model leads to Nan values after circa 5000 steps. I tried this out on two different runs - in both attempts it failed at about the same number of steps. Here I turn mixed precision off.

*** Epoch 0, global step 5210: ***     Train loss: 68.4540
time per step = 0:00:2.114
/opt/OpenSeq2Seq/open_seq2seq/data/speech2text/speech_utils.py:180: RuntimeWarning: invalid value encountered in true_divide
  features = (features - m) / s
*** Epoch 0, global step 5220: ***     Train loss: 202.0522
time per step = 0:00:2.173
*** Epoch 0, global step 5230: ***     Train loss: 150.4963
time per step = 0:00:2.152

LogFile : https://drive.google.com/file/d/1ephybwlzUsmw4fsZCss99RtpQdXeZhmv/view?usp=sharing Config : https://drive.google.com/file/d/13Hk3BagDqM2f0fI9Kann-zF1GeRDrJtD/view?usp=sharing

System Information

$ nvidia-smi
Sun Dec 23 12:10:54 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   40C    P0    70W / 300W |  15732MiB / 16130MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
.... etc.
$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
.... etc.
borisgin commented 5 years ago

Looks like two related issues: 1) The NaN is summary histogram can be caused by high learning rate. Can you try to reduce it by 4x, please? 2) The bug in speech pre-processing is probably related to very short speech sequence in your dataset: /opt/OpenSeq2Seq/open_seq2seq/data/speech2text/speech_utils.py:180: RuntimeWarning: invalid value encountered in true_divide features = (features - m) / s

BUG 1:
"tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv11/bn/gamma_0 [[Node: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv11/bn/gamma_0 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv11/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv11/bn/gamma/read/_3947)]] [[Node: ForwardPass/w2l_encoder/conv33/res/kernel/read/_3910 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_816_ForwardPass/w2l_encoder/conv33/res/kernel/read", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

s

darraghdog commented 5 years ago

Thanks, I am rerunning this now... Bug1 did not reappear yet. Bug2 appeared after I removed the smallest sized wav files, so I am now removing any files containing transcription text with less than 3 characters and running again. There were a few single letter transcriptions. I imagine this was the problem. Also I am assuming mixed precision training is not possible with the V100 so have this off.

borisgin commented 5 years ago

You can use mixed precision with V100 :)

darraghdog commented 5 years ago

Appears to be running smoothly now. Made a few changes,

Zipped log and config attached for reference.
Thank you!

dh_jasper_241218a.out.gz dh_jasper_241218a.py.gz