Jasper training from scratch `invalid value encountered in true_divide`

darraghdog commented 5 years ago

Describe the bug

Training from scratch on Jasper model leads to Nan values after circa 5000 steps. I tried this out on two different runs - in both attempts it failed at about the same number of steps. Here I turn mixed precision off.

*** Epoch 0, global step 5210: ***     Train loss: 68.4540
time per step = 0:00:2.114
/opt/OpenSeq2Seq/open_seq2seq/data/speech2text/speech_utils.py:180: RuntimeWarning: invalid value encountered in true_divide
  features = (features - m) / s
*** Epoch 0, global step 5220: ***     Train loss: 202.0522
time per step = 0:00:2.173
*** Epoch 0, global step 5230: ***     Train loss: 150.4963
time per step = 0:00:2.152

LogFile : https://drive.google.com/file/d/1ephybwlzUsmw4fsZCss99RtpQdXeZhmv/view?usp=sharing Config : https://drive.google.com/file/d/13Hk3BagDqM2f0fI9Kann-zF1GeRDrJtD/view?usp=sharing

System Information

$ nvidia-smi
Sun Dec 23 12:10:54 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   40C    P0    70W / 300W |  15732MiB / 16130MiB |    100%   E. Process |
+-------------------------------+----------------------+----------------------+
.... etc.

$ cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
.... etc.

borisgin commented 5 years ago

Looks like two related issues: 1) The NaN is summary histogram can be caused by high learning rate. Can you try to reduce it by 4x, please? 2) The bug in speech pre-processing is probably related to very short speech sequence in your dataset: /opt/OpenSeq2Seq/open_seq2seq/data/speech2text/speech_utils.py:180: RuntimeWarning: invalid value encountered in true_divide features = (features - m) / s

BUG 1:
"tensorflow.python.framework.errors_impl.InvalidArgumentError: Nan in summary histogram for: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv11/bn/gamma_0 [[Node: Loss_Optimization/variables/ForwardPass/w2l_encoder/conv11/bn/gamma_0 = HistogramSummary[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Loss_Optimization/variables/ForwardPass/w2l_encoder/conv11/bn/gamma_0/tag, ForwardPass/w2l_encoder/conv11/bn/gamma/read/_3947)]] [[Node: ForwardPass/w2l_encoder/conv33/res/kernel/read/_3910 = _SendT=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_816_ForwardPass/w2l_encoder/conv33/res/kernel/read", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

s

darraghdog commented 5 years ago

Thanks, I am rerunning this now... Bug1 did not reappear yet. Bug2 appeared after I removed the smallest sized wav files, so I am now removing any files containing transcription text with less than 3 characters and running again. There were a few single letter transcriptions. I imagine this was the problem. Also I am assuming mixed precision training is not possible with the V100 so have this off.

borisgin commented 5 years ago

You can use mixed precision with V100 :)

darraghdog commented 5 years ago

Appears to be running smoothly now. Made a few changes,

dropped LR to .01, from .05
remove transcriptions with under 3 characters
switch to mixed precision
dropped batch size 64 to 32 (not sure if it was getting problems on larger batches)

Zipped log and config attached for reference.
Thank you!

dh_jasper_241218a.out.gz dh_jasper_241218a.py.gz

NVIDIA / OpenSeq2Seq

Jasper training from scratch `invalid value encountered in true_divide` #325