Open mironnn opened 4 years ago
Hey! In your log you can see that I0911 11:03:37.456050 5675 Utils.cpp:102] Filtered 1/1 samples
which means that all samples are filtered. Could you run with --decoder_nthreads=1 --maxisz=1000000
so that you will not filter your sample? Also could you say what is the target transcription length?
I started decoder with --nthread_decoder=1
--maxisz=1000000
options and have the same result:
I0914 08:24:05.279361 112 W2lListFilesDataset.cpp:141] 1 files found.
I0914 08:24:05.279836 112 Utils.cpp:102] Filtered 1/1 samples
I0914 08:24:05.280282 112 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0914 08:24:05.281785 122 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
Also could you say what is the target transcription length? I have no ground transcription for this file.
But I also tried with dev-clean-1272-128104-0000 ./w2l/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac 5.855 mister quilter is the apostle of the middle classes and we are glad to welcome his gospel
from librispech.
As flac and converted wav (sox used). They are also filtered.
Could you please advice something?
Could you try with --maxisz=1000000 --maxtsz=1000000 --minisz=0 --minisz=0
?
Yes, thank you --minisz=0
helped and it worked 👍
But I noticed, if I use audio without ground truth, file is also filtered. for e.g.
Row from test.lst
file:
flac /root/host/flac.wav 5.855 mister quilter is the apostle of the middle classes and we are glad to welcome his gospe
-> will work
and
flac /root/host/flac.wav 5.855
--> file will be filtered.
How can I avoid this?
Could you try --minisz=-1
? Can you confirm that you see again I0914 08:24:05.279836 112 Utils.cpp:102] Filtered 1/1 samples
in this case?
Yes, Filtered 1/1 samples
I've tried --minisz=-1
and my lst file containes:
flac /root/host/flac.wav 5.855
(it is audio from librispeech dataset converted to wav)
Full log
root@5ad90d8a5ec1:~/wav2letter/build# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
root@5ad90d8a5ec1:~/wav2letter/build# ./Decoder \
> --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
> --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
> --lmweight=0.5515838301157 \
> --wordscore=0.52526055643809 \
> --minloglevel=0 \
> --logtostderr=1 \
> --nthread_decoder=1 \
> --maxisz=1000000 \
> --minisz=1
I0921 07:25:27.469930 23 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0921 07:25:27.473843 23 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.220180 23 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
(0): View (-1 80 1 0)
(1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
(3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
(4): ReLU
(5): Dropout (0.100000)
(6): LayerNorm ( axis : { 1 2 } , size : -1)
(7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
(10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
(11): ReLU
(12): Dropout (0.100000)
(13): LayerNorm ( axis : { 1 2 } , size : -1)
(14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
(18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
(19): ReLU
(20): Dropout (0.100000)
(21): LayerNorm ( axis : { 1 2 } , size : -1)
(22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
(27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
(28): ReLU
(29): Dropout (0.100000)
(30): LayerNorm ( axis : { 1 2 } , size : -1)
(31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(36): Reorder (2,1,0,3)
(37): View (2160 -1 1 0)
(38): Linear (2160->9998) (with bias)
(39): View (9998 0 -1 1)
I0921 07:25:30.220360 23 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0921 07:25:30.220366 23 Decode.cpp:84] [Network] Number of params: 115111823
I0921 07:25:30.220383 23 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.225986 23 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=1000000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=3test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0921 07:25:30.236332 23 Decode.cpp:127] Number of classes (network): 9998
I0921 07:25:32.854022 23 Decode.cpp:134] Number of words: 200001
I0921 07:25:33.202980 23 Decode.cpp:247] [Decoder] LM constructed.
I0921 07:25:37.651134 23 Decode.cpp:274] [Decoder] Trie planted.
I0921 07:25:38.087067 23 Decode.cpp:286] [Decoder] Trie smeared.
I0921 07:25:38.992707 23 W2lListFilesDataset.cpp:141] 1 files found.
I0921 07:25:38.993213 23 Utils.cpp:102] Filtered 1/1 samples
I0921 07:25:38.993324 23 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0921 07:25:38.995147 33 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
cc @vineelpratap is there any option to accept empty transcriptions?
Could you use --mintsz=-1 and see if it helps
Yes,
Filtered 1/1 samples
I've tried
--minisz=-1
and my lst file containes:flac /root/host/flac.wav 5.855
(it is audio from librispeech dataset converted to wav)Full log
root@5ad90d8a5ec1:~/wav2letter/build# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH root@5ad90d8a5ec1:~/wav2letter/build# ./Decoder \ > --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \ > --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \ > --lmweight=0.5515838301157 \ > --wordscore=0.52526055643809 \ > --minloglevel=0 \ > --logtostderr=1 \ > --nthread_decoder=1 \ > --maxisz=1000000 \ > --minisz=1 I0921 07:25:27.469930 23 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg I0921 07:25:27.473843 23 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin I0921 07:25:30.220180 23 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output] (0): View (-1 80 1 0) (1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 ) (2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), }) (3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias) (4): ReLU (5): Dropout (0.100000) (6): LayerNorm ( axis : { 1 2 } , size : -1) (7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200] (8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200] (9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), }) (10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias) (11): ReLU (12): Dropout (0.100000) (13): LayerNorm ( axis : { 1 2 } , size : -1) (14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520] (15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520] (16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520] (17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), }) (18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias) (19): ReLU (20): Dropout (0.100000) (21): LayerNorm ( axis : { 1 2 } , size : -1) (22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840] (26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), }) (27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias) (28): ReLU (29): Dropout (0.100000) (30): LayerNorm ( axis : { 1 2 } , size : -1) (31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160] (36): Reorder (2,1,0,3) (37): View (2160 -1 1 0) (38): Linear (2160->9998) (with bias) (39): View (9998 0 -1 1) I0921 07:25:30.220360 23 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion I0921 07:25:30.220366 23 Decode.cpp:84] [Network] Number of params: 115111823 I0921 07:25:30.220383 23 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin I0921 07:25:30.225986 23 Decode.cpp:106] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=1000000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=3test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0921 07:25:30.236332 23 Decode.cpp:127] Number of classes (network): 9998 I0921 07:25:32.854022 23 Decode.cpp:134] Number of words: 200001 I0921 07:25:33.202980 23 Decode.cpp:247] [Decoder] LM constructed. I0921 07:25:37.651134 23 Decode.cpp:274] [Decoder] Trie planted. I0921 07:25:38.087067 23 Decode.cpp:286] [Decoder] Trie smeared. I0921 07:25:38.992707 23 W2lListFilesDataset.cpp:141] 1 files found. I0921 07:25:38.993213 23 Utils.cpp:102] Filtered 1/1 samples I0921 07:25:38.993324 23 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0 I0921 07:25:38.995147 33 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I see in the loog that mintsz=2 not -1, could you try to fix it?
@tlikhomanenko Thank you for your comment. Yes, you-re right.
I did mintsz=-1
and file is not filtered. But decoder is still hangs.
root@6ca99de58c9c:~/wav2letter/build# ./Decoder \
> --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
> --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
> --lmweight=0.5515838301157 \
> --wordscore=0.52526055643809 \
> --minloglevel=0 \
> --logtostderr=1 \
> --nthread_decoder=1 \
> --maxisz=100000 \
> --minisz=1 \
> --mintsz=-1
I0929 10:45:00.953776 133 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0929 10:45:00.957911 133 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0929 10:45:02.714236 133 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
(0): View (-1 80 1 0)
(1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
(3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
(4): ReLU
(5): Dropout (0.100000)
(6): LayerNorm ( axis : { 1 2 } , size : -1)
(7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
(9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
(10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
(11): ReLU
(12): Dropout (0.100000)
(13): LayerNorm ( axis : { 1 2 } , size : -1)
(14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
(17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
(18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
(19): ReLU
(20): Dropout (0.100000)
(21): LayerNorm ( axis : { 1 2 } , size : -1)
(22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
(26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
(27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
(28): ReLU
(29): Dropout (0.100000)
(30): LayerNorm ( axis : { 1 2 } , size : -1)
(31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
(36): Reorder (2,1,0,3)
(37): View (2160 -1 1 0)
(38): Linear (2160->9998) (with bias)
(39): View (9998 0 -1 1)
I0929 10:45:02.714473 133 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0929 10:45:02.714479 133 Decode.cpp:84] [Network] Number of params: 115111823
I0929 10:45:02.714498 133 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0929 10:45:02.718485 133 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=100000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=-1; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=1test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0929 10:45:02.730916 133 Decode.cpp:127] Number of classes (network): 9998
I0929 10:45:04.877595 133 Decode.cpp:134] Number of words: 200001
I0929 10:45:05.131882 133 Decode.cpp:247] [Decoder] LM constructed.
I0929 10:45:08.631345 133 Decode.cpp:274] [Decoder] Trie planted.
I0929 10:45:09.117630 133 Decode.cpp:286] [Decoder] Trie smeared.
I0929 10:45:10.147207 133 W2lListFilesDataset.cpp:141] 1 files found.
I0929 10:45:10.148475 133 Utils.cpp:102] Filtered 0/1 samples
I0929 10:45:10.149003 133 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0929 10:45:10.154240 143 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
@mironnn Can you please help me to use wav2letter to for inferencing sample audio file. I would like to know from where you downloaded the pre trained models and other files such as lexicon and token files. Basically I need information on the below.
|-- model
| |-- 3-gram.pruned.3e-7.bin.qt
| |-- am
| | `-- librispeech-train-all-unigram-10000.tokens
| |-- am_500ms_future_context.arch
| |-- am_500ms_future_context_dev_other.bin
| |-- decode_500ms_right_future_ngram_other.cfg
| `-- decoder
| `-- decoder-unigram-10000-nbest10.lexicon
Please help me regarding this. Thank you
Regards, Manoj
@mironnn Can you please help me to use wav2letter to for inferencing sample audio file. I would like to know from where you downloaded the pre trained models and other files such as lexicon and token files. Basically I need information on the below.
|-- model | |-- 3-gram.pruned.3e-7.bin.qt | |-- am | | `-- librispeech-train-all-unigram-10000.tokens | |-- am_500ms_future_context.arch | |-- am_500ms_future_context_dev_other.bin | |-- decode_500ms_right_future_ngram_other.cfg | `-- decoder | `-- decoder-unigram-10000-nbest10.lexicon
Please help me regarding this. Thank you
Regards, Manoj
All you can find here https://github.com/facebookresearch/wav2letter/tree/master/recipes/streaming_convnets/librispeech
@mironnn could you confirm that if transcription is not empty decoder is working for you? if so we will fix the issue with empty transcription in future (decoder hangs because there is exception which is not throwing correctly to finish the program, this we are fixing already).
Please use for now non-empty transcription, sorry for inconvenience!
Hi,
Please make sure you are using --show
flag.
Another possible scenario where this can happen is when there is an error while decoding inside the threadpool. You can try doing this change in Decode.cpp
and Test.cpp
to make sure the error is shown in the logs.
#before
threadPool.enqueue(...);
#after
auto fut = threadPool.enqueue(...);
fut.get();
Hi, Could you please advice. I've tried to run decoder with streaming_convnets pre-build model and it hangs on at step:
Reproduction Steps:
docker run -v ~/ML/recipes/streaming_convnets:/root/host --rm -itd --ipc=host --name w2l_streaming_convnets wav2letter/wav2letter:cpu-latest
all other commands inside the docker container:export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
cd /root/wav2letter/build
My file structure in
/root/host
test.lst.hyp
,test.lst.log
andtest.lst.ref
are empty and not updated.Config: What i have in
decode_500ms_right_future_ngram_other.cfg
List: What i have in
test.lst
(No ground truth, because in description it is said, that it is not obligatory wiki)Audio: 5.wav file is 10 sec length, 16kHz, 1 Channel(Mono), 16 bits
Used: Macbook 15". CPU docker image with 8 RAM and 3 cores (No GPU)
Also: I've tried to re-build in
wav2letter:cpu-latest
flashlight and wav2letter with v0.2 branches as it is described in dependencies section for model readme. The result is the same, decoder hangs on.Full Log:
Last lines of strace output (if it can be usefull):
No CPU usage after this:
Git log: