flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Decoding during the training of AM. #703

Open gopesh97 opened 4 years ago

gopesh97 commented 4 years ago

I am training my acoustic model. Here is my configuration file.

--datadir=/home/english_data/ --runname=english_train --rundir=/home/training/ --tokensdir=/home/am/ --listdata=true --train=lists/train.lst --valid=lists/dev.lst --input=wav --datadir=/home/english_data/ --runname=english_train --rundir=/home/training/ --tokensdir=/home/am/ --listdata=true --train=lists/train.lst --valid=lists/dev.lst --input=wav --arch=network.arch --archdir=/home/ --lexicon=/home/am/librispeech-train+dev-unigram-10000-nbest10.lexicon --tokens=librispeech-train-all-unigram-10000.tokens --criterion=seq2seq --lr=0.05 --lrcrit=0.05 --momentum=0.0 --stepsize=40 --gamma=0.5 --maxgradnorm=15 --mfsc=true --use_saug=true --dataorder=output_spiral --inputbinsize=25 --filterbanks=80 --attention=keyvalue --encoderdim=512 --attnWindow=softPretrain --softwstd=4 --trainWithWindow=true --pretrainWindow=3 --maxdecoderoutputlen=120 --usewordpiece=true --wordseparator=_ --sampletarget=0.01 --target=ltr --batchsize=4 --labelsmooth=0.05 --nthread=4 --memstepsize=4194304 --eostoken=true --pcttraineval=1 --pctteacherforcing=99 --iter=200 --enable_distributed=true

Currently I am getting this result,

epoch: 58 | lr: 0.025000 | lrcriterion: 0.025000 | runtime: 06:44:50 | bch(ms): 237.77 | smp(ms): 1.02 | fwd(ms): 14.86 | crit-fwd(ms): 1.07 | bwd(ms): 213.77 | optim(ms): 7.74 | loss: 32.41068 | train-LER: 31.00 | train-WER: 47.11 | lists/dev.lst-loss: 16.45153 | lists/dev.lst-LER: 22.34 | lists/dev.lst-WER: 35.14 | avg-isz: 1003 | avg-tsz: 018 | max-tsz: 130 | hrs: 4556.04 | thrpt(sec/sec): 675.23

I wanted to know how are you internally decoding the dev.lst during this training. I mean, are you using the greedy path or the beam search decoder. Also, what parameters are you using for the same, and among those parameters, which ones are being randomly chosen?

tlikhomanenko commented 4 years ago

During training Viterbi WER (greedy path) is reported in the logs. Decoding with lm we do separately. The practice is to pick the best Viterbi WER snapshot and then decode it with some LM. In case of decoding we just randomly sample hyper-parameters (like lm weight and word score, for example) and choose those parameters values which give the best dev set WER.