Open gopesh97 opened 4 years ago
During training Viterbi WER (greedy path) is reported in the logs. Decoding with lm we do separately. The practice is to pick the best Viterbi WER snapshot and then decode it with some LM. In case of decoding we just randomly sample hyper-parameters (like lm weight and word score, for example) and choose those parameters values which give the best dev set WER.
I am training my acoustic model. Here is my configuration file.
--datadir=/home/english_data/ --runname=english_train --rundir=/home/training/ --tokensdir=/home/am/ --listdata=true --train=lists/train.lst --valid=lists/dev.lst --input=wav --datadir=/home/english_data/ --runname=english_train --rundir=/home/training/ --tokensdir=/home/am/ --listdata=true --train=lists/train.lst --valid=lists/dev.lst --input=wav --arch=network.arch --archdir=/home/ --lexicon=/home/am/librispeech-train+dev-unigram-10000-nbest10.lexicon --tokens=librispeech-train-all-unigram-10000.tokens --criterion=seq2seq --lr=0.05 --lrcrit=0.05 --momentum=0.0 --stepsize=40 --gamma=0.5 --maxgradnorm=15 --mfsc=true --use_saug=true --dataorder=output_spiral --inputbinsize=25 --filterbanks=80 --attention=keyvalue --encoderdim=512 --attnWindow=softPretrain --softwstd=4 --trainWithWindow=true --pretrainWindow=3 --maxdecoderoutputlen=120 --usewordpiece=true --wordseparator=_ --sampletarget=0.01 --target=ltr --batchsize=4 --labelsmooth=0.05 --nthread=4 --memstepsize=4194304 --eostoken=true --pcttraineval=1 --pctteacherforcing=99 --iter=200 --enable_distributed=true
Currently I am getting this result,
epoch: 58 | lr: 0.025000 | lrcriterion: 0.025000 | runtime: 06:44:50 | bch(ms): 237.77 | smp(ms): 1.02 | fwd(ms): 14.86 | crit-fwd(ms): 1.07 | bwd(ms): 213.77 | optim(ms): 7.74 | loss: 32.41068 | train-LER: 31.00 | train-WER: 47.11 | lists/dev.lst-loss: 16.45153 | lists/dev.lst-LER: 22.34 | lists/dev.lst-WER: 35.14 | avg-isz: 1003 | avg-tsz: 018 | max-tsz: 130 | hrs: 4556.04 | thrpt(sec/sec): 675.23
I wanted to know how are you internally decoding the dev.lst during this training. I mean, are you using the greedy path or the beam search decoder. Also, what parameters are you using for the same, and among those parameters, which ones are being randomly chosen?