flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

not getting enough throughput in training #341

Closed SY-nc closed 5 years ago

SY-nc commented 5 years ago

I have two neural networks- one in the tutorials directory, the second one I made by removing some layers from the one in recipes/librispeech/configs/conv_glu which looks like this:

V -1 1 NFEAT 0
WN 3 C NFEAT 400 13 1 170
GLU 2
DO 0.2
WN 3 C 200 440 14 1 0
GLU 2
DO 0.214
WN 3 C 220 484 15 1 0
GLU 2
DO 0.22898
WN 3 C 242 532 16 1 0
GLU 2
DO 0.2450086
WN 3 C 266 584 17 1 0
GLU 2
DO 0.262159202
WN 3 C 292 642 18 1 0
GLU 2
DO 0.28051034614
WN 3 C 321 706 19 1 0
GLU 2
DO 0.30014607037
WN 3 C 353 776 20 1 0
GLU 2
DO 0.321156295296
WN 3 C 388 852 21 1 0
GLU 2
DO 0.343637235966
RO 2 0 3 1
WN 0 L 426 852
GLU 0
DO 0.343637235966
WN 0 L 426 NLABEL

I have set the same flagsfile for both of them, having the following config:

--input=wav
--tokens=data/tokens.txt
--criterion=asg
--lr=0.6
--lrcrit=0.006
--linseg=1
#--momentum=0.8
--maxgradnorm=0.2
--replabel=2
--surround=|
--onorm=target
--sqnorm=true
--mfsc=true
--filterbanks=40
--nthread=6
--batchsize=1
--transdiag=4
--iter=2500

I had to comment the momentum flag, since keeping it to any non zero value was returning an out of mem error.

For comparing these models, I started training on a single audio file.

For the neural network in recipes, I got the following output after epoch 1176:

I0702 14:30:14.080596  6403 Train.cpp:500] Epoch 1176 started!
I0702 14:30:18.234652  6403 Train.cpp:296] epoch:     1176 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:02 | bch(ms): 2854.00 | smp(ms): 5.32 | fwd(ms): 1294.58 | crit-fwd(ms): 12.62 | bwd(ms): 1502.03 | optim(ms): 46.85 | loss:    4.00269 | train-TER: 16.35 | train-WER: 23.81 | data/dev-clean-loss:    3.65063 | data/dev-clean-TER: 15.38 | data/dev-clean-WER: 19.05 | avg-isz: 407 | avg-tsz: 106 | max-tsz: 106 | hrs:    0.00 | thrpt(sec/sec): 1.43

Even in the subsequent epochs, the loss was not reducing any further. For the network in tutorial, all the losses and error rates reduced to zero in the 730th epoch:

I0702 14:49:03.621923 21249 Train.cpp:500] Epoch 730 started!
I0702 14:49:03.734987 21249 Train.cpp:296] epoch:      730 | lr: 0.600000 | lrcriterion: 0.006000 | runtime: 00:00:00 | bch(ms): 81.56 | smp(ms): 1.65 | fwd(ms): 34.98 | crit-fwd(ms): 6.77 | bwd(ms): 35.26 | optim(ms): 7.71 | loss:    0.00061 | train-TER:  0.00 | train-WER:  0.00 | data/dev-clean-loss:    0.00024 | data/dev-clean-TER:  0.00 | data/dev-clean-WER:  0.00 | avg-isz: 407 | avg-tsz: 106 | max-tsz: 106 | hrs:    0.00 | thrpt(sec/sec): 49.90

Clearly, the differrence in throughput and error rates is quite high.

As I'm training on a single GPU, I later reduced my lr and lrcrit by 10 times.

For lr=0.1 and lrcrit=0.001, the bigger network gave a better loss(23) as compared to the smaller one(31) in 500 epochs.

But isn't 500 epochs too much to converge on a single audio file?

Any other suggestions which might help in converging the model to zero loss in less time?

I really want to use the bigger network.

SY-nc commented 5 years ago

Playing with learning rate and maxgradnorm parameters is resulting in improved results.