flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Loss exploded after epoch 84 #220

Closed keshawnhsieh closed 5 years ago

keshawnhsieh commented 5 years ago

I was running the training demo provided in tutorial, as said using train-clean-100 training set and test on dev-clean and test-clean. I ran it with one single Titan Xp card and found the loss exploded after epoch 84. Confused about what happened behind this strange phenomenon ?

epoch:       82 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:04:26 | bch(ms): 37.37 | smp(ms): 0.21 | fwd(ms): 29.35 | crit-fwd(ms): 26.63 | bwd(ms): 6.37 | optim(ms): 0.81 | loss:    0.78993 | train-TER:  2.19 | data/dev-clean-TER: 18.55 | avg-isz: 1267 | avg-tsz: 213 | max-tsz: 400 | hrs:  100.47 | thrpt(sec/sec): 1356.32
epoch:       83 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:04:25 | bch(ms): 37.25 | smp(ms): 0.22 | fwd(ms): 29.20 | crit-fwd(ms): 26.49 | bwd(ms): 6.38 | optim(ms): 0.81 | loss:    0.87373 | train-TER:  2.31 | data/dev-clean-TER: 18.87 | avg-isz: 1267 | avg-tsz: 213 | max-tsz: 400 | hrs:  100.47 | thrpt(sec/sec): 1360.96
epoch:       84 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:04:02 | bch(ms): 33.93 | smp(ms): 0.21 | fwd(ms): 25.96 | crit-fwd(ms): 23.24 | bwd(ms): 6.38 | optim(ms): 0.82 | loss:        inf | train-TER: 52.35 | data/dev-clean-TER: 70.78 | avg-isz: 1267 | avg-tsz: 213 | max-tsz: 400 | hrs:  100.47 | thrpt(sec/sec): 1494.12
epoch:       85 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:04:07 | bch(ms): 34.69 | smp(ms): 0.21 | fwd(ms): 26.75 | crit-fwd(ms): 24.03 | bwd(ms): 6.38 | optim(ms): 0.82 | loss:        inf | train-TER: 58.40 | data/dev-clean-TER: 19.24 | avg-isz: 1267 | avg-tsz: 213 | max-tsz: 400 | hrs:  100.47 | thrpt(sec/sec): 1461.48
epoch:       86 | lr: 0.100000 | lrcriterion: 0.000000 | runtime: 00:04:08 | bch(ms): 34.78 | smp(ms): 0.22 | fwd(ms): 26.80 | crit-fwd(ms): 24.09 | bwd(ms): 6.38 | optim(ms): 0.82 | loss:        inf | train-TER: 34.92 | data/dev-clean-TER: 88.51 | avg-isz: 1267 | avg-tsz: 213 | max-tsz: 400 | hrs:  100.47 | thrpt(sec/sec): 1457.62
AdrienDuff commented 5 years ago

It must be similar to https://github.com/facebookresearch/wav2letter/issues/168. Since you are using only 1 GPU and that the provided flags are working with 8 GPUs you have to divide your lr by 8 and it should work.