flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

High Error rate after training #675

Closed davidbelle closed 4 years ago

davidbelle commented 4 years ago

Hello. I have a requirement to perform voice to text on a 8k audio and am not getting good error rates for it be useful. I just want to post here what I did step by an step and hopefully will get some answers.

I have come to understand that I need to create my own models with 8k datasets. I tried out the pre-trained models on sota and was pretty happy with the results (of course I could only test these using 16k audio), so I have gone down the path of creating 8k AM and LM using SOTA librespeech Resnet CTC.

I inspected _prepare_librispeech_wp_and_officiallexicon.py and noticed it was calling data/librispeech/prepare.py which was downloading the datasets, so I ran prepare.py by itself, ran a script to overwrite the downloaded flac's to 8k, modified the prepare_librispeech_wp_and_officiallexicon.py to not call prepare.py, and finally called prepare_librispeech_wp_and_officiallexicon.py.

So now I have the dataset as 8k, and all the bits and pieces needed to run train.

/root/wav2letter/build/Train train --flagsfile train.cfg --minloglevel=0 --logtostderr=1 Here's train.cfg:

--runname=am_resnet_ctc_librispeech
--rundir=/root/sota8k/run
--archdir=/root/sota8k
--arch=am_resnet_ctc.arch
--tokensdir=/root/sota8k/models/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/sota8k/models/am/librispeech-train+dev-unigram-10000-nbest10.lexicon
--train=/root/sota8k/librispeech/lists/train-clean-100.lst,/root/sota8k/librispeech/lists/train-clean-360.lst,/root/sota8k/librispeech/lists/train-other-500.lst
--valid=dev-clean:/root/sota8k/librispeech/lists/dev-clean.lst,dev-other:/root/sota8k/librispeech/lists/dev-other.lst
--criterion=ctc
--mfsc
--labelsmooth=0.05
--wordseparator=_
--usewordpiece=true
--sampletarget=0.01
--lr=0.4
--linseg=0
--momentum=0.6
--maxgradnorm=1
--onorm=target
--sqnorm
--nthread=4
--batchsize=4
--filterbanks=80
--lrcosine
--iter=71683
--minloglevel=0
--mintsz=2
--minisz=200
--reportiters=2000
--logtostderr
--enable_distributed
--samplerate=8000

Initially I had iter set to 500 as per the original file, when I ran it it output

"Epoch 1 started!
Finished training"

I came across a post that mentioned that --iter has changed what it actually means and to set it to a much higher number. I don't know how high to go. So I'm guessing this could be the problem. (Here's the mentioned comment)

Here's the head and tail of my training output once it actually starts the epoch:

I0601 03:15:08.297699  1602 Train.cpp:345] epoch:        1 | nupdates:         2000 | lr: 0.099904 | lrcriterion: 0.000000 | runtime: 00:20:24 | bch(ms): 612.02 | smp(ms): 0.64 | fwd(ms): 157.37 | crit-fwd(ms): 5.64 | bwd(ms): 383.45 | optim(ms): 60.31 | loss:   51.34038 | train-TER: 105.90 | train-WER: 105.25 | dev-clean-loss:   32.97257 | dev-clean-TER: 91.00 | dev-clean-WER: 94.17 | dev-other-loss:   31.06735 | dev-other-TER: 91.01 | dev-other-WER: 95.11 | avg-isz: 1233 | avg-tsz: 044 | max-tsz: 082 | hrs:   27.41 | thrpt(sec/sec): 80.63
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):87
Size of free block pool (large):103
Total native mallocs:219
Total native frees:0
I0601 03:38:00.720160  1602 Train.cpp:345] epoch:        1 | nupdates:         4000 | lr: 0.199232 | lrcriterion: 0.000000 | runtime: 00:20:08 | bch(ms): 604.33 | smp(ms): 0.56 | fwd(ms): 154.70 | crit-fwd(ms): 5.67 | bwd(ms): 379.96 | optim(ms): 59.94 | loss:   43.10687 | train-TER: 95.59 | train-WER: 97.64 | dev-clean-loss:   32.62396 | dev-clean-TER: 93.38 | dev-clean-WER: 97.31 | dev-other-loss:   30.62956 | dev-other-TER: 92.84 | dev-other-WER: 97.19 | avg-isz: 1214 | avg-tsz: 044 | max-tsz: 082 | hrs:   26.98 | thrpt(sec/sec): 80.36
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):87
Size of free block pool (large):103
Total native mallocs:220
Total native frees:0
I0601 04:00:26.815016  1602 Train.cpp:345] epoch:        1 | nupdates:         6000 | lr: 0.297411 | lrcriterion: 0.000000 | runtime: 00:19:52 | bch(ms): 596.25 | smp(ms): 0.56 | fwd(ms): 152.12 | crit-fwd(ms): 5.53 | bwd(ms): 374.41 | optim(ms): 59.95 | loss:   42.30737 | train-TER: 95.74 | train-WER: 98.16 | dev-clean-loss:   32.86901 | dev-clean-TER: 90.65 | dev-clean-WER: 95.02 | dev-other-loss:   30.93067 | dev-other-TER: 90.73 | dev-other-WER: 95.68 | avg-isz: 1196 | avg-tsz: 043 | max-tsz: 076 | hrs:   26.58 | thrpt(sec/sec): 80.25
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):82
Size of free block pool (large):105
Total native mallocs:220
Total native frees:0
I0601 04:22:58.060353  1602 Train.cpp:345] epoch:        1 | nupdates:         8000 | lr: 0.393869 | lrcriterion: 0.000000 | runtime: 00:19:57 | bch(ms): 598.55 | smp(ms): 0.56 | fwd(ms): 153.31 | crit-fwd(ms): 5.61 | bwd(ms): 375.38 | optim(ms): 59.92 | loss:   43.87696 | train-TER: 89.99 | train-WER: 98.19 | dev-clean-loss:   34.46011 | dev-clean-TER: 67.60 | dev-clean-WER: 97.93 | dev-other-loss:   32.54059 | dev-other-TER: 68.51 | dev-other-WER: 99.93 | avg-isz: 1202 | avg-tsz: 043 | max-tsz: 082 | hrs:   26.72 | thrpt(sec/sec): 80.35
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):88
Size of free block pool (large):105
Total native mallocs:220
Total native frees:0

26 more and then

I0601 14:40:08.950882  1602 Train.cpp:345] epoch:        1 | nupdates:        62000 | lr: 0.084238 | lrcriterion: 0.000000 | runtime: 00:20:49 | bch(ms): 624.90 | smp(ms): 0.58 | fwd(ms): 162.22 | crit-fwd(ms): 5.86 | bwd(ms): 392.29 | optim(ms): 59.77 | loss:   34.71605 | train-TER: 69.47 | train-WER: 86.83 | dev-clean-loss:   23.21709 | dev-clean-TER: 67.36 | dev-clean-WER: 81.63 | dev-other-loss:   22.94107 | dev-other-TER: 70.64 | dev-other-WER: 85.31 | avg-isz: 1264 | avg-tsz: 045 | max-tsz: 092 | hrs:   28.10 | thrpt(sec/sec): 80.94
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):86
Size of free block pool (large):120
Total native mallocs:232
Total native frees:0
I0601 15:03:50.234331  1602 Train.cpp:345] epoch:        1 | nupdates:        64000 | lr: 0.067026 | lrcriterion: 0.000000 | runtime: 00:20:45 | bch(ms): 622.83 | smp(ms): 0.58 | fwd(ms): 162.30 | crit-fwd(ms): 5.84 | bwd(ms): 390.15 | optim(ms): 59.74 | loss:   33.62747 | train-TER: 67.31 | train-WER: 85.01 | dev-clean-loss:   22.21125 | dev-clean-TER: 62.01 | dev-clean-WER: 78.36 | dev-other-loss:   22.16429 | dev-other-TER: 65.60 | dev-other-WER: 82.57 | avg-isz: 1266 | avg-tsz: 045 | max-tsz: 082 | hrs:   28.15 | thrpt(sec/sec): 81.35
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):80
Size of free block pool (large):120
Total native mallocs:232
Total native frees:0
I0601 15:27:33.647449  1602 Train.cpp:345] epoch:        1 | nupdates:        66000 | lr: 0.049684 | lrcriterion: 0.000000 | runtime: 00:20:47 | bch(ms): 623.90 | smp(ms): 0.58 | fwd(ms): 162.44 | crit-fwd(ms): 5.89 | bwd(ms): 391.01 | optim(ms): 59.75 | loss:   33.00060 | train-TER: 66.03 | train-WER: 83.84 | dev-clean-loss:   21.35524 | dev-clean-TER: 61.68 | dev-clean-WER: 77.15 | dev-other-loss:   21.62479 | dev-other-TER: 66.07 | dev-other-WER: 82.25 | avg-isz: 1269 | avg-tsz: 046 | max-tsz: 081 | hrs:   28.22 | thrpt(sec/sec): 81.41
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):84
Size of free block pool (large):125
Total native mallocs:232
Total native frees:0
I0601 15:51:28.527168  1602 Train.cpp:345] epoch:        1 | nupdates:        68000 | lr: 0.032247 | lrcriterion: 0.000000 | runtime: 00:21:00 | bch(ms): 630.12 | smp(ms): 0.59 | fwd(ms): 164.87 | crit-fwd(ms): 5.96 | bwd(ms): 394.72 | optim(ms): 59.75 | loss:   32.45069 | train-TER: 64.68 | train-WER: 82.59 | dev-clean-loss:   20.72777 | dev-clean-TER: 59.30 | dev-clean-WER: 74.96 | dev-other-loss:   21.07410 | dev-other-TER: 63.83 | dev-other-WER: 80.40 | avg-isz: 1283 | avg-tsz: 046 | max-tsz: 079 | hrs:   28.52 | thrpt(sec/sec): 81.46
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):86
Size of free block pool (large):120
Total native mallocs:232
Total native frees:0
I0601 16:15:36.191566  1602 Train.cpp:345] epoch:        1 | nupdates:        70000 | lr: 0.014749 | lrcriterion: 0.000000 | runtime: 00:21:12 | bch(ms): 636.16 | smp(ms): 0.59 | fwd(ms): 166.78 | crit-fwd(ms): 5.98 | bwd(ms): 398.85 | optim(ms): 59.74 | loss:   32.35625 | train-TER: 64.17 | train-WER: 82.00 | dev-clean-loss:   20.49159 | dev-clean-TER: 58.99 | dev-clean-WER: 74.17 | dev-other-loss:   20.93436 | dev-other-TER: 63.63 | dev-other-WER: 79.95 | avg-isz: 1292 | avg-tsz: 046 | max-tsz: 078 | hrs:   28.73 | thrpt(sec/sec): 81.28
Memory Manager Stats
MemoryManager type: CachingMemoryManager
Number of allocated blocks:333
Size of free block pool (small):78
Size of free block pool (large):122
Total native mallocs:232
Total native frees:0
I0601 16:19:05.691792  1602 Train.cpp:566] Shuffling trainset
I0601 16:19:05.697589  1602 Train.cpp:573] Epoch 2 started!
I0601 16:33:10.791930  1602 Train.cpp:748] Finished training

I feel like it shouldn't say Epoch 1 the whole time but not sure? Training ran over night, about 12 hours total I think. I would have thought it would take longer. From what I've seen of someone else's post, the WER dropped significantly even after the 4th iteration.

Running Test show's the error rate still very high.

Here's running the test

~/wav2letter/build/Test --flagsfile /root/sota8k/test.cfg --minloglevel=0 --logtostderr=1
--am=/root/sota8k/run/am_resnet_ctc_librispeech/001_model_dev-clean.bin
--tokensdir=/root/sota8k/models/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/sota8k/models/decoder/decoder-unigram-10000-nbest10.lexicon
--uselexicon=false
--datadir=/root/sota8k/librispeech/lists
--test=test-other.lst
I0602 00:28:06.392462  1869 Test.cpp:318] [Test test-other.lst (2937 samples) in 120.839s (actual decoding time 0.0411s/sample) -- WER: 79.4884, LER: 63.2948]

Here's running decode

~/wav2letter/build/Decoder --flagsfile /root/sota8k/decode.cfg --minloglevel=0 --logtostderr=1
--am=/root/sota8k/run/am_resnet_ctc_librispeech/001_model_dev-clean.bin
--tokensdir=/root/sota8k/models/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/sota8k/models/decoder/decoder-unigram-10000-nbest10.lexicon
--lm=/root/sota8k/models/decoder/4-gram.arpa
--datadir=/root/sota8k/librispeech/lists
--test=test-clean.lst
--uselexicon=true
--sclite=/root/sota8k/decoder-run
--decodertype=wrd
--lmtype=kenlm
--silscore=0
--beamsize=500
--beamsizetoken=100
--beamthreshold=100
--nthread_decoder=4
--smearing=max
--show
--showletters
--lmweight=0.86994439339913
--wordscore=0.58878028376141

[Decode test-clean.lst (2618 samples) in 941.085s (actual decoding time 1.43s/sample) -- WER: 72.9714, LER: 57.3536]

Running it against my own 8k audio shows unfavourable results. So is it the --iter flag? What should I set this too? Or could it be something else?

Thanks

David

tlikhomanenko commented 4 years ago

Hi @davidbelle

One thing we did in latest commits for transformer training pipeline support is adding warmup, whose default value is 8k updates. So non-transformer models this should be set to --warmup=0 (sorry for misleading things here, will fix them). You can see that in your training loop there is increasing lr value, not the value you set in the config. The second thing is that we trained with total batchsize 128 and all params optimized for it, so if you have total batch <=32, (even possibly for 64 too) - this could be very different.

I feel like it shouldn't say Epoch 1 the whole time but not sure? Training ran over night, about 12 hours total I think. I would have thought it would take longer. From what I've seen of someone else's post, the WER dropped significantly even after the 4th iteration.

Because you set --reportiters=2000 then it really printing every 2000 updates and if you train on 1 gpu (I suspect this) with batchsize=4 (the whole train has 280k samples) then you will really have ~70k updates before finishing 1 epoch.

For --iter you can set value = 500 280k / total_batchsize, where total_batchsize = --batchsize ngpus. So the main problem that you have very high WER is that you need to train longer.

From what I've seen of someone else's post, the WER dropped significantly even after the 4th iteration.

Here please check what was the total batch size (and again params for 128 total batch are not appropriate for batch 4).

davidbelle commented 4 years ago

@tlikhomanenko Thank you so much!!! Will try that out. It is indeed one GPU! :)

davidbelle commented 4 years ago

@tlikhomanenko Just want to say thanks again. I'm not a data scientist and wav2letter is my first experience with DL/ML. You could probably tell I have no idea haha. I spent a decent amount of time just now reading up on batch size's. I'm running training now on a box in AWS with 1 Telsla T4 GPU which has 15 GB of memory, running it with a batch size of 32, it's not running out of memory so hopefully it'll stick. I suspect this will take about 3 days to run. Hopefully that sounds about right to you as well?

tlikhomanenko commented 4 years ago

Resnet CTC model we trained for the full convergence during around week with 128 total batch size. But possibly with 32 batch and 3-4 days you get not SOTA result but good thing to start.