Error rate explosion + Loss is NaN value on full LibriSpeech train

mrjj commented 5 years ago

I experiencing problem with crash on training with full LibriSpeech set that have fingerprint like provided below together with error rate explosion.

terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
terminate called after throwing an instance of 'std::runtime_error'
  what():  Loss is NaN value
  what():  Loss is NaN value
  what():  Loss is NaN value
terminate called after throwing an instance of 'std::runtime_error'
  what():  Loss is NaN value
*** Aborted at 1552556820 (unix time) try "date -d @1552556820" if you are using GNU date ***
*** Aborted at 1552556820 (unix time) try "date -d @1552556820" if you are using GNU date ***
PC: @     0x7f22b1648428 gsignal
*** SIGABRT (@0xa) received by PID 10 (TID 0x7f230dd6b780) from PID 10; stack trace: ***
PC: @     0x7f4655809428 gsignal
*** SIGABRT (@0x8) received by PID 8 (TID 0x7f46b1f2c780) from PID 8; stack trace: ***
*** Aborted at 1552556820 (unix time) try "date -d @1552556820" if you are using GNU date ***
PC: @     0x7f3c3713c428 gsignal

Error rate is getting out of sane bounds before:

...
I0314 00:23:51.337224     8 Train.cpp:279] epoch:        1 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:48 | bch(ms): 3383.80 | smp(ms): 0.47 | fwd(ms): 836.97 | crit-fwd(ms): 145.65 | bwd(ms): 24$
3.71 | optim(ms): 36.89 | loss: 1840.01942 | train-TER: 89.08 | data/dev-other-TER: 98.18 | data/dev-clean-TER: 98.31 | avg-isz: 1273 | avg-tsz: 208 | max-tsz: 296 | hrs:    1.81 | thrpt(sec/sec): 60.22
I0314 00:27:28.415665     8 Train.cpp:279] epoch:        1 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:43 | bch(ms): 3246.98 | smp(ms): 0.46 | fwd(ms): 794.36 | crit-fwd(ms): 129.04 | bwd(ms): 23$
1.11 | optim(ms): 36.85 | loss: 1608.08552 | train-TER: 89.28 | data/dev-other-TER: 97.60 | data/dev-clean-TER: 97.81 | avg-isz: 1195 | avg-tsz: 201 | max-tsz: 288 | hrs:    1.70 | thrpt(sec/sec): 58.92
I0314 00:31:08.952953     8 Train.cpp:279] epoch:        1 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:47 | bch(ms): 3351.19 | smp(ms): 0.48 | fwd(ms): 827.74 | crit-fwd(ms): 142.27 | bwd(ms): 24$
0.72 | optim(ms): 36.86 | loss: 1443.55908 | train-TER: 88.76 | data/dev-other-TER: 98.65 | data/dev-clean-TER: 98.68 | avg-isz: 1258 | avg-tsz: 211 | max-tsz: 298 | hrs:    1.79 | thrpt(sec/sec): 60.08
I0314 00:31:59.934522     8 Train.cpp:581] Finished LinSeg
I0314 00:31:59.935485     8 Train.cpp:481] Shuffling trainset
I0314 00:32:00.222002     8 Train.cpp:488] Epoch 2 started!
I0314 00:35:16.039367     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:02:13 | bch(ms): 3185.40 | smp(ms): 8.93 | fwd(ms): 779.28 | crit-fwd(ms): 91.24 | bwd(ms): 233$
.29 | optim(ms): 40.63 | loss: 2347.57987 | train-TER: 88.87 | data/dev-other-TER: 98.93 | data/dev-clean-TER: 98.91 | avg-isz: 1262 | avg-tsz: 212 | max-tsz: 295 | hrs:    2.36 | thrpt(sec/sec): 63.39
I0314 00:38:48.047472     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:38 | bch(ms): 3087.08 | smp(ms): 0.44 | fwd(ms): 704.64 | crit-fwd(ms): 37.42 | bwd(ms): 232$
.15 | optim(ms): 36.97 | loss: 1812.51870 | train-TER: 87.76 | data/dev-other-TER: 97.76 | data/dev-clean-TER: 97.99 | avg-isz: 1194 | avg-tsz: 201 | max-tsz: 288 | hrs:    1.70 | thrpt(sec/sec): 61.90
I0314 00:42:21.494837     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:40 | bch(ms): 3133.05 | smp(ms): 0.43 | fwd(ms): 708.94 | crit-fwd(ms): 38.62 | bwd(ms): 236$
.59 | optim(ms): 36.90 | loss: 1199.48142 | train-TER: 86.92 | data/dev-other-TER: 97.64 | data/dev-clean-TER: 97.88 | avg-isz: 1207 | avg-tsz: 205 | max-tsz: 300 | hrs:    1.72 | thrpt(sec/sec): 61.69

...

I0314 09:35:30.338338     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:45 | bch(ms): 3303.15 | smp(ms): 0.48 | fwd(ms): 731.01 | crit-fwd(ms): 34.45 | bwd(ms): 2509
.17 | optim(ms): 36.95 | loss: 788709507072.00000 | train-TER: 99.99 | data/dev-other-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1294 | avg-tsz: 215 | max-tsz: 297 | hrs:    1.84 | thrpt(sec/sec): 62
.70
I0314 09:39:07.508718     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:43 | bch(ms): 3228.92 | smp(ms): 0.45 | fwd(ms): 711.75 | crit-fwd(ms): 33.16 | bwd(ms): 2455
.49 | optim(ms): 36.92 | loss: 6209667072.00000 | train-TER: 99.98 | data/dev-other-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1238 | avg-tsz: 208 | max-tsz: 298 | hrs:    1.76 | thrpt(sec/sec): 61.3
8
I0314 09:42:48.864497     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:47 | bch(ms): 3359.63 | smp(ms): 0.51 | fwd(ms): 742.42 | crit-fwd(ms): 36.56 | bwd(ms): 2553
.50 | optim(ms): 36.88 | loss: 83114328064.00000 | train-TER: 99.99 | data/dev-other-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1323 | avg-tsz: 223 | max-tsz: 296 | hrs:    1.88 | thrpt(sec/sec): 63$
05
I0314 09:46:22.526621     8 Train.cpp:279] epoch:        2 | lr: 0.100000 | lrcriterion: 0.001000 | runtime: 00:01:40 | bch(ms): 3137.23 | smp(ms): 0.44 | fwd(ms): 692.19 | crit-fwd(ms): 30.39 | bwd(ms): 238$
.46 | optim(ms): 36.89 | loss: 3072675807232.00000 | train-TER: 100.00 | data/dev-other-TER: 100.00 | data/dev-clean-TER: 100.00 | avg-isz: 1186 | avg-tsz: 199 | max-tsz: 297 | hrs:    1.69 | thrpt(sec/sec):
60.53
terminate called after throwing an instance of 'std::runtime_error'
...

System summary:

$ docker -v ; nvidia-container-runtime -v ; cat /etc/redhat-release ; rpm -qf /etc/redhat-release ; yum list installed | grep "nvidia\|cuda"
Docker version 18.09.3, build 774a1f4
runc version 1.0.0-rc6+dev
commit: 12f6a991201fdb8f82579582d5e00e28fba06d0a-dirty
spec: 1.0.1-dev
CentOS Linux release 7.6.1810 (Core)
centos-release-7-6.1810.2.el7.centos.x86_64
cuda-repo-rhel7-9-2-local.x86_64     9.2.148-1                      installed
libnvidia-container-tools.x86_64     1.0.1-1                        @libnvidia-container
libnvidia-container1.x86_64          1.0.1-1                        @libnvidia-container
nvidia-container-runtime.x86_64      2.0.0-1.docker18.09.3          @nvidia-container-runtime
nvidia-container-runtime-hook.x86_64 1.4.0-2                        @nvidia-container-runtime
nvidia-docker2.noarch                2.0.3-1.docker18.09.3.ce       @nvidia-docker

nvidia-smi (no training currently running, and during training it seems healthy)

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 396.37                 Driver Version: 396.37                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX TIT...  Off  | 00000000:04:00.0 Off |                  N/A |
| 22%   33C    P0    69W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX TIT...  Off  | 00000000:08:00.0 Off |                  N/A |
| 22%   36C    P0    70W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  GeForce GTX TIT...  Off  | 00000000:84:00.0 Off |                  N/A |
| 22%   35C    P0    68W / 250W |      0MiB / 12212MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  GeForce GTX TIT...  Off  | 00000000:88:00.0 Off |                  N/A |
| 22%   34C    P0    64W / 250W |      0MiB / 12212MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Docker image is one that was released 6 weeks ago:

$ docker images | grep wav2letter
wav2letter/wav2letter                        cuda-latest                    408735bacff3        6 weeks ago         8.86GB

Now i've see you had an update (supposing because Intel apt repo got well finally) and planning to try one more time with new cuda-6369490 image.

Before i've tried to change lot of things in conf and env but on second and sometimes third epoch of LibriSpeech training models going unstable after a while. Also i had no luck with resuming training from start of second epoch as well, its getting unstable instantly.

lr/lrcrit, i've tried are 0.6/0.006 reduced on half (Hardware setup is 4 device) 0.3/0.003, reduced 10 and 100 times with same results.

Also i've made several runs with nvidia-smi monitoring 6 and 3 measurements per minutes and found no traces of running out of memory.

Running containers with host memory addressing have same result. Containers are running docker hub image because Intel got their apt repo broken recently and i can't rebuild my own currently due MKL lib delivery being blocked.

There is not exact failure point, i planning to try to make it reproducible by removing learning samples randomization and try other hardware as well.

I didn't have log saved but i remember that CPU and CUDA both finished successfully on tutorial clean LibriSpeech dataset.

Here is snapshots of latest failed run, its configuration, logs, models and kernel dumps after the crash: https://drive.google.com/drive/folders/1e8VaIpFEvqWiLdGeaqLTBEb52BkwK-me?usp=sharing

I can provide more snapshots, logs and dumps if it will help with investigation.

This issue have fingerprints close to following ones: https://github.com/facebookresearch/wav2letter/issues/127 https://github.com/facebookresearch/wav2letter/issues/128

And for me it do not seems to be related to this one: https://github.com/facebookresearch/wav2letter/issues/153

mrjj commented 5 years ago

Problem didn't gone with new container version, last successful step:

I0316 13:39:32.846099     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:41 | bch(ms): 3182.75 | smp(ms): 0.45 | fwd(ms): 670.18 | crit-fwd(ms): 24.00 | bwd(ms): 2450
.74 | optim(ms): 36.94 | loss: 506908.10449 | train-TER: 283.04 | data/dev-other-TER: 98.34 | data/dev-clean-TER: 97.21 | avg-isz: 1149 | avg-tsz: 196 | max-tsz: 292 | hrs:    1.63 | thrpt(sec/sec): 57.76

Loss gone out of sane bound and start growing in the end of epoch 1

mrjj commented 5 years ago

Now i'm sure looking for root cause inside learn process itself, Null bug is just unhandled result of situation when

Still trying to resolve this, quite surprised that CTC criterion, that expected to be straightforward without unpredictable oscillations with model learning process. And its giving me quite interesting things with TER (see both of them) this skewed converging seems quite weird for me. I suspected openMPI as guilty but checking all stable 1.x.x didn't change anything (except big stable bug with docker overlay2 mounts, but its unrelated)

Nvidia OpenSeq2Seq that relying only CTC https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition/wave2letter.html and it works fine on practice not being too sensitive about learning rates.

I've checsummed Libri dataset i'm using and its seems fine, also re-checked for any side effects in library handling audio import. So for now i've excluded possible reason of spoiled data samples that may make model unstable without instant crash.

Now i'm brute-forcing cartesian product of all versions of MKL, stable AF and CUDA that could possible work, it affecting results, but still i have criterion explosion and Null crash right after and the second that i getting stale model floating around 99 - 100 for several epochs and then, well, just exploding or not. Better idea i have is just make hyperparams search against your vanilla criterion slope 10^1 - 10^-2 and both learning rates around 10^-1 10^-4 and batching params, but i'm not sure about any positive results being stable against different datasets, and just don't have enough hardware for this kind of experiments.

I have small tutorial Libre dataset working, that tells me that generally it works. But even on clean Libri it fails.

Last thing i planning to try is using NVIDIA cheat with dynamic datasets synthesis so it possible to not run params search against criterions space but run search against just dataset volume. Starting from small one and looking for volume border when model will go unstable and then start adjusting hyperparams around this bifurcation point to understand what local parameters slope i have and when model will go unstable again getting more info about model parameters slope.

I0321 11:22:59.564994     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:34 | bch(ms): 2960.28 | smp(ms): 0.37 | fwd(ms): 715.67 | crit-fwd(ms): 33.77 | bwd(ms): 2214.23 | optim(ms): 27.38 | loss:        inf | train-TER: 244.40 | data/dev-clean-TER: 89.55 | avg-isz: 1239 | avg-tsz: 211 | max-tsz: 291 | hrs:    1.76 | thrpt(sec/sec): 66.98
I0321 11:25:32.822854     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:37 | bch(ms): 3043.15 | smp(ms): 0.38 | fwd(ms): 735.99 | crit-fwd(ms): 35.59 | bwd(ms): 2276.39 | optim(ms): 27.40 | loss:        inf | train-TER: 259.31 | data/dev-clean-TER: 88.26 | avg-isz: 1298 | avg-tsz: 218 | max-tsz: 297 | hrs:    1.85 | thrpt(sec/sec): 68.26
I0321 11:28:04.120324     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:35 | bch(ms): 2983.67 | smp(ms): 0.36 | fwd(ms): 721.02 | crit-fwd(ms): 33.59 | bwd(ms): 2231.52 | optim(ms): 27.45 | loss:        inf | train-TER: 305.40 | data/dev-clean-TER: 87.72 | avg-isz: 1256 | avg-tsz: 213 | max-tsz: 311 | hrs:    1.79 | thrpt(sec/sec): 67.40
I0321 11:30:34.026384     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:34 | bch(ms): 2941.95 | smp(ms): 0.36 | fwd(ms): 707.78 | crit-fwd(ms): 33.03 | bwd(ms): 2203.26 | optim(ms): 27.39 | loss:        inf | train-TER: 308.82 | data/dev-clean-TER: 87.16 | avg-isz: 1220 | avg-tsz: 210 | max-tsz: 299 | hrs:    1.74 | thrpt(sec/sec): 66.40
I0321 11:33:05.637965     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:35 | bch(ms): 2996.50 | smp(ms): 0.37 | fwd(ms): 721.74 | crit-fwd(ms): 32.58 | bwd(ms): 2243.51 | optim(ms): 27.39 | loss:        inf | train-TER: 337.44 | data/dev-clean-TER: 87.09 | avg-isz: 1265 | avg-tsz: 212 | max-tsz: 290 | hrs:    1.80 | thrpt(sec/sec): 67.56
I0321 11:35:42.221000     8 Train.cpp:292] epoch:        2 | lr: 0.300000 | lrcriterion: 0.003000 | runtime: 00:01:40 | bch(ms): 3149.99 | smp(ms): 0.39 | fwd(ms): 763.84 | crit-fwd(ms): 38.10 | bwd(ms): 2354.23 | optim(ms): 27.45 | loss:        inf | train-TER: 334.83 | data/dev-clean-TER: 86.60 | avg-isz: 1383 | avg-tsz: 234 | max-tsz: 309 | hrs:    1.97 | thrpt(sec/sec): 70.27

lunixbochs commented 5 years ago

It took me around 8 tries and some luck to train librispeech.

Leave your lr on 0.6 and lrcriterion on 0.006, set your reportiters to something much higher like 500 (like 20 minutes instead of 1 minute), and restart anytime it hits epoch 2 without improving. Once I had a good epoch 2 librispeech model, I was able to quickly train from there on many machines. I assume there's too much chance involved in successfully getting over the initial training hurdle with these parameters and this dataset?

I didn't have any problems like this with the Common Voice dataset, though the final model was much worse (I assume because the training set was so small). I still had problems with librispeech when training on just a subset eg train-clean-100.

If you want I can also send you my starter epoch2 model and you can work from there.

lucgeo commented 5 years ago

Hi,

@lunixbochs: Can you share please your epoch 2? I tried to train it for multiple times but it didn't converge, I would really appreciate if you can share it.

lunixbochs commented 5 years ago

I uploaded some of my models, including an epoch2 librispeech model: https://talonvoice.com/research/ Please consider donating to support my work if you find this useful.

If someone ends up with a pure librispeech model under TER-clean 2.64, please send it my way. I know the paper claims TER-clean 2.3 with this architecture.

andresy commented 5 years ago

@mrjj In your 001_config file you are using ASG and not CTC. Could you share the command line (or train.cfg) file you are using, to make sure I'm not missing anything?

Also, we recently blacklisted some cuDNN algorithm which was leading to divergence on our hardware - did you check with a recent version of flashlight?

andresy commented 5 years ago

Another issue we found was related to GLU - ArrayFire 3.6.2 or this flashlight commit fixes it.

mrjj commented 5 years ago

@mrjj In your 001_config file you are using ASG and not CTC. Could you share the command line (or train.cfg) file you are using, to make sure I'm not missing anything?

Also, we recently blacklisted some cuDNN algorithm which was leading to divergence on our hardware - did you check with a recent version of flashlight?

Yes, the run that not converging and finishing up due unhandled loss explosion is using ASG. As in provided config and bug title.

In next comment i describing my experience of finding workaround using CTC getting not converged too. Config for CTC is not provided, i'll try to find it in logs for this run.

Problem with CUDA backend you resolved above in ArrayFire seems to be possible solution for me, because when i adjusting configuration adjusting lr according to the difference between your reference and my hardware capacity. levaving other things intact, i getting dramatically different result. And this seems for me like difference on GPU device architecture+driver level.

I assume there's too much chance involved in successfully getting over the initial training hurdle with these parameters and this dataset?

Thank You for sharing found solution, i can suppose that according to FB common practice of tuning online systems with bayesian and other forms of hyperparametrisation, its quite OK to have systems that highly sensitive to this kind of parameters change - they anyway will be auto-tuned before going GA.

For my example run i using frequent reporting to point on place where model going unstable and this process being not instant. "Everything fine ... Crash" behaviour could have much more possible reasons.

I've also noticed that reportiters are affecting how training is going and keeping it higher, and thats the place i suspecting some bug(s) too.

The major question for me is that Libri is good but quite small and limited dataset. Its fine for synthetic training or training focused on stop-words detection. But general transcript solutions are requiring significantly larger datasets. Feeding datasets e.g. 10 times larger (Tacotron2+nose mixing+reverb training) currently didn't give me hope to get it converged.

I'll try new rev with patched ArrayFire. Thank you!

rajeevbaalwan commented 4 years ago

Hi @mrjj , I'm also getting the same issue when training on 100 hour subset of Librispeech using CTC Loss. were you able to train successfully and achieved results ?? If solved how did you get to the param like lr,lrcrit that worked for you ??

tlikhomanenko commented 3 years ago

Please have a look at sota models + new codebase in flashlight. Closing due to old issue.

BernardoOlisan commented 2 years ago

@rajeevbaalwan did you solved it? I'm having the same problem bc I use the 100 hr librispeech but im getting loss:nan and if i use another dataset i don't get it:/

flashlight / wav2letter

Error rate explosion + Loss is NaN value on full LibriSpeech train #237