flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Decoder hangs up with pre-trained model #826

Open mironnn opened 4 years ago

mironnn commented 4 years ago

Hi, Could you please advice. I've tried to run decoder with streaming_convnets pre-build model and it hangs on at step:

I0911 09:35:51.761121  5618 W2lListFilesDataset.cpp:141] 1 files found.
I0911 09:35:51.761857  5618 Utils.cpp:102] Filtered 1/1 samples
I0911 09:35:51.762053  5618 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0911 09:35:51.762890  5628 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
I0911 09:35:51.763069  5629 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I0911 09:35:51.763093  5630 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1

Reproduction Steps:

  1. docker run -v ~/ML/recipes/streaming_convnets:/root/host --rm -itd --ipc=host --name w2l_streaming_convnets wav2letter/wav2letter:cpu-latest all other commands inside the docker container:
  2. export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
  3. cd /root/wav2letter/build
  4. ./Decoder \
    --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
    --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
    --lmweight=0.5515838301157 \
    --wordscore=0.52526055643809 \
    --minloglevel=0 \
    --logtostderr=1 \
    --nthread_decoder=3

My file structure in /root/host

root@9c696da2978d:~/host# tree
.
|-- 5.wav
|-- model
|   |-- 3-gram.pruned.3e-7.bin.qt
|   |-- am
|   |   `-- librispeech-train-all-unigram-10000.tokens
|   |-- am_500ms_future_context.arch
|   |-- am_500ms_future_context_dev_other.bin
|   |-- decode_500ms_right_future_ngram_other.cfg
|   `-- decoder
|       `-- decoder-unigram-10000-nbest10.lexicon
|-- test.lst
|-- test.lst.hyp
|-- test.lst.log
`-- test.lst.ref

test.lst.hyp, test.lst.log and test.lst.ref are empty and not updated.

Config: What i have in decode_500ms_right_future_ngram_other.cfg

# Decoding config for Librispeech
# Replace `[...]`, `[DATA_DST]`, `[MODEL_DST]` with appropriate paths
# for test-other (best params for dev-other)
--am=/root/host/model/am_500ms_future_context_dev_other.bin
--tokensdir=/root/host/model/am
--tokens=librispeech-train-all-unigram-10000.tokens
--lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon
--datadir=/root/host
--test=test.lst
--uselexicon=true
--sclite=/root/host
--decodertype=wrd
--lmtype=kenlm
--silscore=0
--beamsize=500
--beamsizetoken=100
--beamthreshold=100
--nthread_decoder=8
--smearing=max
--show
--showletters

List: What i have in test.lst (No ground truth, because in description it is said, that it is not obligatory wiki)

0: /root/host/5.wav 10000

Audio: 5.wav file is 10 sec length, 16kHz, 1 Channel(Mono), 16 bits

Used: Macbook 15". CPU docker image with 8 RAM and 3 cores (No GPU)

Also: I've tried to re-build in wav2letter:cpu-latest flashlight and wav2letter with v0.2 branches as it is described in dependencies section for model readme. The result is the same, decoder hangs on.

Full Log:

root@9c696da2978d:~/wav2letter/build# ./Decoder \
>   --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
>   --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
>   --lmweight=0.5515838301157 \
>   --wordscore=0.52526055643809 \
>   --minloglevel=0 \
>   --logtostderr=1 \
>   --nthread_decoder=3
I0911 11:03:28.951413  5675 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0911 11:03:28.955456  5675 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0911 11:03:30.589258  5675 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
    (0): View (-1 80 1 0)
    (1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
    (2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
    (3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
    (4): ReLU
    (5): Dropout (0.100000)
    (6): LayerNorm ( axis : { 1 2 } , size : -1)
    (7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
    (8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
    (9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
    (10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
    (11): ReLU
    (12): Dropout (0.100000)
    (13): LayerNorm ( axis : { 1 2 } , size : -1)
    (14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
    (18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
    (19): ReLU
    (20): Dropout (0.100000)
    (21): LayerNorm ( axis : { 1 2 } , size : -1)
    (22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
    (27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
    (28): ReLU
    (29): Dropout (0.100000)
    (30): LayerNorm ( axis : { 1 2 } , size : -1)
    (31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (36): Reorder (2,1,0,3)
    (37): View (2160 -1 1 0)
    (38): Linear (2160->9998) (with bias)
    (39): View (9998 0 -1 1)
I0911 11:03:30.589411  5675 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0911 11:03:30.589418  5675 Decode.cpp:84] [Network] Number of params: 115111823
I0911 11:03:30.589581  5675 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0911 11:03:30.593014  5675 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=3; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0911 11:03:30.603224  5675 Decode.cpp:127] Number of classes (network): 9998
I0911 11:03:32.673702  5675 Decode.cpp:134] Number of words: 200001
I0911 11:03:32.904979  5675 Decode.cpp:247] [Decoder] LM constructed.
I0911 11:03:36.306375  5675 Decode.cpp:274] [Decoder] Trie planted.
I0911 11:03:36.761252  5675 Decode.cpp:286] [Decoder] Trie smeared.
I0911 11:03:37.455034  5675 W2lListFilesDataset.cpp:141] 1 files found.
I0911 11:03:37.456050  5675 Utils.cpp:102] Filtered 1/1 samples
I0911 11:03:37.456212  5675 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0911 11:03:37.456913  5685 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
I0911 11:03:37.457432  5687 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
I0911 11:03:37.457517  5686 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2

Last lines of strace output (if it can be usefull):

brk(0x5621214aa000)                     = 0x5621214aa000
brk(0x5621214cb000)                     = 0x5621214cb000
brk(0x5621214ec000)                     = 0x5621214ec000
brk(0x56212150d000)                     = 0x56212150d000
openat(AT_FDCWD, "/root/host/test.lst", O_RDONLY) = 8
read(8, "0: /root/host/5.wav 10000\n", 8191) = 26
read(8, "", 8191)                       = 0
gettid()                                = 5692
write(2, "I0911 11:30:35.061928  5692 W2lL"..., 73I0911 11:30:35.061928  5692 W2lListFilesDataset.cpp:141] 1 files found.
) = 73
close(8)                                = 0
gettid()                                = 5692
write(2, "I0911 11:30:35.064453  5692 Util"..., 64I0911 11:30:35.064453  5692 Utils.cpp:102] Filtered 1/1 samples
) = 64
gettid()                                = 5692
write(2, "I0911 11:30:35.066020  5692 W2lL"..., 86I0911 11:30:35.066020  5692 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
) = 86
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc9205ec000
mprotect(0x7fc9205ed000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc920debef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc920dec9d0, tls=0x7fc920dec700, child_tidptr=0x7fc920dec9d0) = 5702
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc91fdeb000
mprotect(0x7fc91fdec000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc9205eaef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc9205eb9d0, tls=0x7fc9205eb700, child_tidptr=0x7fc9205eb9d0) = 5703
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc91f5ea000
mprotect(0x7fc91f5eb000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc91fde9ef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc91fdea9d0, tls=0x7fc91fdea700, child_tidptr=0x7fc91fdea9d0) = 5704
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc91ede9000
mprotect(0x7fc91edea000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7fc91f5e8ef0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7fc91f5e99d0, tls=0x7fc91f5e9700, child_tidptr=0x7fc91f5e99d0) = 5705
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
I0911 11:30:35.073495  5702 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
I0911 11:30:35.074316  5704 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 1
futex(0x7fff63119ff8, FUTEX_WAKE_PRIVATE, 1) = 1
I0911 11:30:35.074676  5705 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 2
futex(0x7fff63119ffc, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fc920dec9d0, FUTEX_WAIT, 5702, NULL

No CPU usage after this:

image

Git log:

root@9c696da2978d:~/wav2letter/build# git log
commit 55e9ebc233a001a21c4033aa2a8a60dbf1fe62ec (grafted, HEAD -> master, origin/master)
Author: Tatiana Likhomanenko <antares@fb.com>
Date:   Wed Aug 12 13:08:08 2020 -0700

    fix lexicon-free https://github.com/facebookresearch/wav2letter/issues/777; add mosesdecoder version for sota/2019

    Summary: title

    Reviewed By: vineelpratap

    Differential Revision: D23063624

    fbshipit-source-id: ffb59e483c5ccbb0c0d8145d7c8afc610c15287a
tlikhomanenko commented 4 years ago

Hey! In your log you can see that I0911 11:03:37.456050 5675 Utils.cpp:102] Filtered 1/1 samples which means that all samples are filtered. Could you run with --decoder_nthreads=1 --maxisz=1000000 so that you will not filter your sample? Also could you say what is the target transcription length?

mironnn commented 4 years ago

I started decoder with --nthread_decoder=1 --maxisz=1000000 options and have the same result:

I0914 08:24:05.279361   112 W2lListFilesDataset.cpp:141] 1 files found.
I0914 08:24:05.279836   112 Utils.cpp:102] Filtered 1/1 samples
I0914 08:24:05.280282   112 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0914 08:24:05.281785   122 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0

Also could you say what is the target transcription length? I have no ground transcription for this file.

But I also tried with dev-clean-1272-128104-0000 ./w2l/LibriSpeech/dev-clean/1272/128104/1272-128104-0000.flac 5.855 mister quilter is the apostle of the middle classes and we are glad to welcome his gospel from librispech. As flac and converted wav (sox used). They are also filtered. Could you please advice something?

tlikhomanenko commented 4 years ago

Could you try with --maxisz=1000000 --maxtsz=1000000 --minisz=0 --minisz=0?

mironnn commented 4 years ago

Yes, thank you --minisz=0 helped and it worked 👍

But I noticed, if I use audio without ground truth, file is also filtered. for e.g.

Row from test.lst file: flac /root/host/flac.wav 5.855 mister quilter is the apostle of the middle classes and we are glad to welcome his gospe -> will work

and

flac /root/host/flac.wav 5.855 --> file will be filtered.

How can I avoid this?

tlikhomanenko commented 4 years ago

Could you try --minisz=-1? Can you confirm that you see again I0914 08:24:05.279836 112 Utils.cpp:102] Filtered 1/1 samples in this case?

mironnn commented 4 years ago

Yes, Filtered 1/1 samples

I've tried --minisz=-1 and my lst file containes: flac /root/host/flac.wav 5.855 (it is audio from librispeech dataset converted to wav)

Full log

root@5ad90d8a5ec1:~/wav2letter/build# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
root@5ad90d8a5ec1:~/wav2letter/build# ./Decoder \
>   --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
>   --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
>   --lmweight=0.5515838301157 \
>   --wordscore=0.52526055643809 \
>   --minloglevel=0 \
>   --logtostderr=1 \
>   --nthread_decoder=1 \
>   --maxisz=1000000 \
>   --minisz=1
I0921 07:25:27.469930    23 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0921 07:25:27.473843    23 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.220180    23 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
    (0): View (-1 80 1 0)
    (1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
    (2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
    (3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
    (4): ReLU
    (5): Dropout (0.100000)
    (6): LayerNorm ( axis : { 1 2 } , size : -1)
    (7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
    (8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
    (9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
    (10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
    (11): ReLU
    (12): Dropout (0.100000)
    (13): LayerNorm ( axis : { 1 2 } , size : -1)
    (14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
    (18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
    (19): ReLU
    (20): Dropout (0.100000)
    (21): LayerNorm ( axis : { 1 2 } , size : -1)
    (22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
    (27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
    (28): ReLU
    (29): Dropout (0.100000)
    (30): LayerNorm ( axis : { 1 2 } , size : -1)
    (31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (36): Reorder (2,1,0,3)
    (37): View (2160 -1 1 0)
    (38): Linear (2160->9998) (with bias)
    (39): View (9998 0 -1 1)
I0921 07:25:30.220360    23 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0921 07:25:30.220366    23 Decode.cpp:84] [Network] Number of params: 115111823
I0921 07:25:30.220383    23 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.225986    23 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=1000000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=3test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0921 07:25:30.236332    23 Decode.cpp:127] Number of classes (network): 9998
I0921 07:25:32.854022    23 Decode.cpp:134] Number of words: 200001
I0921 07:25:33.202980    23 Decode.cpp:247] [Decoder] LM constructed.
I0921 07:25:37.651134    23 Decode.cpp:274] [Decoder] Trie planted.
I0921 07:25:38.087067    23 Decode.cpp:286] [Decoder] Trie smeared.
I0921 07:25:38.992707    23 W2lListFilesDataset.cpp:141] 1 files found.
I0921 07:25:38.993213    23 Utils.cpp:102] Filtered 1/1 samples
I0921 07:25:38.993324    23 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0921 07:25:38.995147    33 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
tlikhomanenko commented 4 years ago

cc @vineelpratap is there any option to accept empty transcriptions?

vineelpratap commented 4 years ago

Could you use --mintsz=-1 and see if it helps

tlikhomanenko commented 4 years ago

Yes, Filtered 1/1 samples

I've tried --minisz=-1 and my lst file containes: flac /root/host/flac.wav 5.855 (it is audio from librispeech dataset converted to wav)

Full log

root@5ad90d8a5ec1:~/wav2letter/build# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
root@5ad90d8a5ec1:~/wav2letter/build# ./Decoder \
>   --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
>   --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
>   --lmweight=0.5515838301157 \
>   --wordscore=0.52526055643809 \
>   --minloglevel=0 \
>   --logtostderr=1 \
>   --nthread_decoder=1 \
>   --maxisz=1000000 \
>   --minisz=1
I0921 07:25:27.469930    23 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0921 07:25:27.473843    23 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.220180    23 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
  (0): View (-1 80 1 0)
  (1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
  (2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
  (3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
  (4): ReLU
  (5): Dropout (0.100000)
  (6): LayerNorm ( axis : { 1 2 } , size : -1)
  (7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
  (8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
  (9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
  (10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
  (11): ReLU
  (12): Dropout (0.100000)
  (13): LayerNorm ( axis : { 1 2 } , size : -1)
  (14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
  (15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
  (16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
  (17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
  (18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
  (19): ReLU
  (20): Dropout (0.100000)
  (21): LayerNorm ( axis : { 1 2 } , size : -1)
  (22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
  (23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
  (24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
  (25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
  (26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
  (27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
  (28): ReLU
  (29): Dropout (0.100000)
  (30): LayerNorm ( axis : { 1 2 } , size : -1)
  (31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
  (32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
  (33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
  (34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
  (35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
  (36): Reorder (2,1,0,3)
  (37): View (2160 -1 1 0)
  (38): Linear (2160->9998) (with bias)
  (39): View (9998 0 -1 1)
I0921 07:25:30.220360    23 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0921 07:25:30.220366    23 Decode.cpp:84] [Network] Number of params: 115111823
I0921 07:25:30.220383    23 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0921 07:25:30.225986    23 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=1000000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=3test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0921 07:25:30.236332    23 Decode.cpp:127] Number of classes (network): 9998
I0921 07:25:32.854022    23 Decode.cpp:134] Number of words: 200001
I0921 07:25:33.202980    23 Decode.cpp:247] [Decoder] LM constructed.
I0921 07:25:37.651134    23 Decode.cpp:274] [Decoder] Trie planted.
I0921 07:25:38.087067    23 Decode.cpp:286] [Decoder] Trie smeared.
I0921 07:25:38.992707    23 W2lListFilesDataset.cpp:141] 1 files found.
I0921 07:25:38.993213    23 Utils.cpp:102] Filtered 1/1 samples
I0921 07:25:38.993324    23 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 0
I0921 07:25:38.995147    33 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0

I see in the loog that mintsz=2 not -1, could you try to fix it?

mironnn commented 4 years ago

@tlikhomanenko Thank you for your comment. Yes, you-re right.

I did mintsz=-1 and file is not filtered. But decoder is still hangs.

root@6ca99de58c9c:~/wav2letter/build# ./Decoder \
>    --flagsfile /root/host/model/decode_500ms_right_future_ngram_other.cfg \
>    --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt \
>    --lmweight=0.5515838301157 \
>    --wordscore=0.52526055643809 \
>    --minloglevel=0 \
>    --logtostderr=1 \
>    --nthread_decoder=1 \
>    --maxisz=100000 \
>    --minisz=1 \
>    --mintsz=-1
I0929 10:45:00.953776   133 Decode.cpp:58] Reading flags from file /root/host/model/decode_500ms_right_future_ngram_other.cfg
I0929 10:45:00.957911   133 Decode.cpp:75] [Network] Reading acoustic model from /root/host/model/am_500ms_future_context_dev_other.bin
I0929 10:45:02.714236   133 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> (38) -> (39) -> output]
    (0): View (-1 80 1 0)
    (1): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
    (2): Padding (0, { (5, 3), (0, 0), (0, 0), (0, 0), })
    (3): Conv2D (1->15, 10x1, 2,1, 0,0, 1, 1) (with bias)
    (4): ReLU
    (5): Dropout (0.100000)
    (6): LayerNorm ( axis : { 1 2 } , size : -1)
    (7): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
    (8): Time-Depth Separable Block (9, 80, 15) [1200 -> 1200 -> 1200]
    (9): Padding (0, { (7, 1), (0, 0), (0, 0), (0, 0), })
    (10): Conv2D (15->19, 10x1, 2,1, 0,0, 1, 1) (with bias)
    (11): ReLU
    (12): Dropout (0.100000)
    (13): LayerNorm ( axis : { 1 2 } , size : -1)
    (14): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (15): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (16): Time-Depth Separable Block (9, 80, 19) [1520 -> 1520 -> 1520]
    (17): Padding (0, { (9, 1), (0, 0), (0, 0), (0, 0), })
    (18): Conv2D (19->23, 12x1, 2,1, 0,0, 1, 1) (with bias)
    (19): ReLU
    (20): Dropout (0.100000)
    (21): LayerNorm ( axis : { 1 2 } , size : -1)
    (22): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (23): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (24): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (25): Time-Depth Separable Block (11, 80, 23) [1840 -> 1840 -> 1840]
    (26): Padding (0, { (10, 0), (0, 0), (0, 0), (0, 0), })
    (27): Conv2D (23->27, 11x1, 1,1, 0,0, 1, 1) (with bias)
    (28): ReLU
    (29): Dropout (0.100000)
    (30): LayerNorm ( axis : { 1 2 } , size : -1)
    (31): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (32): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (33): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (34): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (35): Time-Depth Separable Block (11, 80, 27) [2160 -> 2160 -> 2160]
    (36): Reorder (2,1,0,3)
    (37): View (2160 -1 1 0)
    (38): Linear (2160->9998) (with bias)
    (39): View (9998 0 -1 1)
I0929 10:45:02.714473   133 Decode.cpp:82] [Criterion] ConnectionistTemporalClassificationCriterion
I0929 10:45:02.714479   133 Decode.cpp:84] [Network] Number of params: 115111823
I0929 10:45:02.714498   133 Decode.cpp:90] [Network] Updating flags from config file: /root/host/model/am_500ms_future_context_dev_other.bin
I0929 10:45:02.718485   133 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=500; --beamsizetoken=100; --beamthreshold=100; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=/root/host/lists; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/host/model/decode_500ms_right_future_ngram_other.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/host/model/decoder/decoder-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/host/model/3-gram.pruned.3e-7.bin.qt; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0.5515838301157; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=100000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=1; --minrate=3; --minsil=0; --mintsz=-1; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=/root/host/sclitedir; --seed=0; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=1test.lst; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/host/model/am; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0.52526055643809; --wordseparator=_; --world_rank=0; --world_size=32; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0929 10:45:02.730916   133 Decode.cpp:127] Number of classes (network): 9998
I0929 10:45:04.877595   133 Decode.cpp:134] Number of words: 200001
I0929 10:45:05.131882   133 Decode.cpp:247] [Decoder] LM constructed.
I0929 10:45:08.631345   133 Decode.cpp:274] [Decoder] Trie planted.
I0929 10:45:09.117630   133 Decode.cpp:286] [Decoder] Trie smeared.
I0929 10:45:10.147207   133 W2lListFilesDataset.cpp:141] 1 files found.
I0929 10:45:10.148475   133 Utils.cpp:102] Filtered 0/1 samples
I0929 10:45:10.149003   133 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0929 10:45:10.154240   143 Decode.cpp:511] [Decoder] Lexicon decoder with word-LM loaded in thread: 0
manojmsrit commented 4 years ago

@mironnn Can you please help me to use wav2letter to for inferencing sample audio file. I would like to know from where you downloaded the pre trained models and other files such as lexicon and token files. Basically I need information on the below.

|-- model
|   |-- 3-gram.pruned.3e-7.bin.qt
|   |-- am
|   |   `-- librispeech-train-all-unigram-10000.tokens
|   |-- am_500ms_future_context.arch
|   |-- am_500ms_future_context_dev_other.bin
|   |-- decode_500ms_right_future_ngram_other.cfg
|   `-- decoder
|       `-- decoder-unigram-10000-nbest10.lexicon

Please help me regarding this. Thank you

Regards, Manoj

mironnn commented 4 years ago

@mironnn Can you please help me to use wav2letter to for inferencing sample audio file. I would like to know from where you downloaded the pre trained models and other files such as lexicon and token files. Basically I need information on the below.

|-- model
|   |-- 3-gram.pruned.3e-7.bin.qt
|   |-- am
|   |   `-- librispeech-train-all-unigram-10000.tokens
|   |-- am_500ms_future_context.arch
|   |-- am_500ms_future_context_dev_other.bin
|   |-- decode_500ms_right_future_ngram_other.cfg
|   `-- decoder
|       `-- decoder-unigram-10000-nbest10.lexicon

Please help me regarding this. Thank you

Regards, Manoj

All you can find here https://github.com/facebookresearch/wav2letter/tree/master/recipes/streaming_convnets/librispeech

tlikhomanenko commented 4 years ago

@mironnn could you confirm that if transcription is not empty decoder is working for you? if so we will fix the issue with empty transcription in future (decoder hangs because there is exception which is not throwing correctly to finish the program, this we are fixing already).

Please use for now non-empty transcription, sorry for inconvenience!

vineelpratap commented 4 years ago

Hi, Please make sure you are using --show flag. Another possible scenario where this can happen is when there is an error while decoding inside the threadpool. You can try doing this change in Decode.cpp and Test.cpp to make sure the error is shown in the logs.

#before 
threadPool.enqueue(...);

#after 
auto fut = threadPool.enqueue(...);
fut.get();