flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Can the recently released seq2seq decoder work? #319

Closed jindyliu closed 5 years ago

jindyliu commented 5 years ago

I tried to complete the training with the example training configuration, but when I used the trained model to decode, there are no candidates output.

[root@dd88309d-124c-4caf-b665-f87d8b1b66d1 /dockerdata/s2s_cfg]# /root/wav2letter_new/wav2letter/build/Decoder --flagsfile=/dockerdata/s2s_cfg/decode_back_em.cfg I0603 18:15:30.205224 48896 Decode.cpp:57] Parsing command line flags I0603 18:15:30.205379 48896 Decode.cpp:61] Reading flags from file /dockerdata/s2s_cfg/decode_back_em.cfg I0603 18:15:30.205730 48896 Decode.cpp:80] [Serialization] Loading file: /dockerdata/s2s_cfg/test-other.lst,test-clean.lst.bin I0603 18:15:31.628330 48896 Decode.cpp:86] [Network] Reading acoustic model from /root/volume/e2e_data_speech/seq2seq_tds_distributed_500/004_model_dev-clean.bin I0603 18:15:33.049324 48896 Decode.cpp:90] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> output] (0): View (-1 80 1 0) (1): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (2): ReLU (3): Dropout (0.200000) (4): LayerNorm ( axes : { 3 } ) (5): Time-Depth Separable Block (21, 80, 10) (6): Time-Depth Separable Block (21, 80, 10) (7): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (8): ReLU (9): Dropout (0.200000) (10): LayerNorm ( axes : { 3 } ) (11): Time-Depth Separable Block (21, 80, 14) (12): Time-Depth Separable Block (21, 80, 14) (13): Time-Depth Separable Block (21, 80, 14) (14): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias) (15): ReLU (16): Dropout (0.200000) (17): LayerNorm ( axes : { 3 } ) (18): Time-Depth Separable Block (21, 80, 18) (19): Time-Depth Separable Block (21, 80, 18) (20): Time-Depth Separable Block (21, 80, 18) (21): Time-Depth Separable Block (21, 80, 18) (22): Time-Depth Separable Block (21, 80, 18) (23): Time-Depth Separable Block (21, 80, 18) (24): View (0 1440 1 0) (25): Reorder (1,0,3,2) (26): Linear (1440->1024) (with bias) I0603 18:15:33.049844 48896 Decode.cpp:93] [Criterion] Seq2SeqCriterion I0603 18:15:33.049880 48896 Decode.cpp:95] [Network] Number of params: 36538460 I0603 18:15:33.049896 48896 Decode.cpp:101] [Network] Updating flags from config file: /root/volume/e2e_data_speech/seq2seq_tds_distributed_500/004_model_dev-clean.bin I0603 18:15:33.050736 48896 Decode.cpp:111] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=; --arch=network.arch; --archdir=/root/volume/w2l_seq2seq_cfg; --attention=keyvalue; --attentionthreshold=0; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=1000; --beamthreshold=25; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=/dockerdata/s2s_data_speech; --dataorder=output_spiral; --decodertype=tkn; --devwin=0; --emission_dir=/dockerdata/s2s_cfg; --enable_distributed=true; --encoderdim=512; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/dockerdata/s2s_cfg/decode_back_em.cfg; --gamma=0.5; --garbage=false; --hardselection=1; --input=flac; --inputbinsize=25; --inputfeeding=false; --iter=200; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/dockerdata/s2s_data_speech/seq2seq/librispeech-train+dev-unigram-10000-nbest10.dict; --linlr=-1; --linlrcrit=-1; --linseg=0; --listdata=true; --lm=/dockerdata/s2s_data_speech/lm/4-gram.arpa.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=2.5; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcrit=0.050000000000000003; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=4194304; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=1; --nthread_decoder=1; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/volume/e2e_data_speech; --runname=seq2seq_tds_distributed_500; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=/dockerdata/s2s_data_speech/decode_log/; --seed=0; --show=true; --showletters=true; --silweight=-0.5; --smearing=max; --smoothingtemperature=1; --softselection=inf; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=test-other.lst,test-clean.lst; --tokens=librispeech-train-all-unigram-10000.vocab-filtered; --tokensdir=/dockerdata/s2s_data_speech/seq2seq; --train=/root/volume/e2e_data_speech/train-clean-100.lst,/root/volume/e2e_data_speech/train-clean-360.lst,/root/volume/e2e_data_speech/train-other-500.lst; --trainWithWindow=true; --transdiag=0; --unkweight=-inf; --usewordpiece=true; --valid=dev-clean:/root/volume/e2e_data_speech/dev-clean.lst,dev-other:/root/volume/e2e_dataspeech/dev-other.lst; --weightdecay=0; --wordscore=1; --wordseparator=; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolizestacktrace=true; --v=0; --vmodule=; I0603 18:15:33.056612 48896 Decode.cpp:117] Number of classes (network): 9998 I0603 18:15:34.233965 48896 Decode.cpp:124] Number of words: 89612 I0603 18:15:34.307266 48896 Decode.cpp:185] [Dataset] Number of samples per thread: 5559 I0603 18:15:34.611796 48896 Decode.cpp:268] [Decoder] LM constructed. I0603 18:15:34.614811 48896 Decode.cpp:368] [Decoder] Seq2Seq decoder with token-LM loaded in thread: 0 [WARNING] No completed candidates. |T|: he was such a big boy that he wore high boots and carried a jack knife |P|: |t|: h e w a s s u c h a b i g b o y t h a t h e w o r e h i g h b o o t s a n d c a r r i e d a j a c k k n i f e |p|: [sample: test-clean-7021-85628-0002, WER: 100%, LER: 100%, slice WER: 100%, slice LER: 100%, progress: 0.0179888%] [WARNING] No completed candidates. |T|: the case of lord mountnorris of all those which were collected with so much industry is the most flagrant and the least excusable |P|: |t|: t h e c a s e o f l o r d m o u n t n o r r i s o f a l l t h o s e w h i c h w e r e c o l l e c t e d w i t h s o m u c h i n d u s t r y i s t h e m o s t f l a g r a n t a n d t h e l e a s t e x c u s a b l e |p|: [sample: test-other-8188-274364-0002, WER: 100%, LER: 100%, slice WER: 100%, slice LER: 100%, progress: 0.0359777%] [WARNING] No completed candidates.

Is there a problem with my decoding configuration?

xuqiantong commented 5 years ago

There will be a comprehensive doc for new decoders series coming out soon. seq2seq decoder is pretty sensitive to the parameters you provided. You may try these parameters for now:

<path_to_your_binary>/decode_cpp \
-decodertype tkn \
-lm <path_to_your_token_LM_not_word_LM> \
-am <path_to_your_token_AM> \
-emission_dir <path_to_your_emission_dir> \
-test <path_to_your_test_set> \
-silweight 0 -maxdecoderoutputlen 120 -maxload -1 -nthread 1 -nthread_decoder 1 -smearing max -show -showletters \
-beamsize 80 -beamthreshold 7 -lmweight 1.2 -wordscore 2.0 -smoothingtemperature 1.0 \
-hardselection 1.5 -softselection 10.0 -attentionthreshold 30