Decode does not transcribe entire audio (Seq2Seq)

BoneGoat commented 4 years ago

I'm trying to transcribe an audio clip which is about 2min long with my Seq2Seq model. The transcription is almost perfect but it stops after a few seconds. How can I decode the entire clip?

I've tested a few clips and it seems that the decoding stops when the audio quality changes. W2L seems to stop decoding when going from studio recorded audio to a phone interview. Is there a way to keep decoding?

I0605 14:44:24.480424   130 Decode.cpp:58] Reading flags from file /root/data/w2l-aligned-experiment/decode-4773-seq2seq-tds.cfg
I0605 14:44:24.480649   130 Decode.cpp:75] [Network] Reading acoustic model from /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
I0605 14:44:25.221102   130 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> output]
    (0): View (-1 80 1 0)
    (1): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (2): ReLU
    (3): Dropout (0.200000)
    (4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (5): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800]
    (6): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800]
    (7): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (8): ReLU
    (9): Dropout (0.200000)
    (10): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (11): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
    (12): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
    (13): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
    (14): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (15): ReLU
    (16): Dropout (0.200000)
    (17): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (18): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (19): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (20): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (21): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (22): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (23): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (24): View (0 1440 1 0)
    (25): Reorder (1,0,3,2)
    (26): Linear (1440->1024) (with bias)
I0605 14:44:25.221129   130 Decode.cpp:82] [Criterion] Seq2SeqCriterion
I0605 14:44:25.221148   130 Decode.cpp:84] [Network] Number of params: 36538460
I0605 14:44:25.221156   130 Decode.cpp:90] [Network] Updating flags from config file: /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
I0605 14:44:25.222061   130 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=network-seq2seq_tds.arch; --archdir=/root/data/w2l-sb-ibm-aligned-experiment; --attention=keyvalue; --attentionthreshold=30; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=80; --beamsizetoken=250000; --beamthreshold=7; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=/root/data/w2l-aligned-experiment; --dataorder=output_spiral; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=tkn; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=false; --encoderdim=512; --eosscore=0; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/data/w2l-aligned-experiment/decode-4773-seq2seq-tds.cfg; --framesizems=25; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=wav; --inputbinsize=25; --inputfeeding=false; --isbeamdump=false; --iter=200; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcosine=false; --lrcrit=0.050000000000000003; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=4194304; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/data/w2l-sb-ibm-aligned-experiment; --runname=4773-seq2seq_tds; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=/root/data/w2l-aligned-experiment/logs; --seed=0; --show=true; --showletters=false; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=decode.csv; --tokens=train-4773-unigram-10000.tokens; --tokensdir=/root/data/w2l-aligned-experiment; --train=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-w2l-noquotes.csv; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/root/data/w2l-sb-ibm-aligned-experiment/dev-4773-w2l-noquotes.csv; --weightdecay=0; --wordscore=2; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0605 14:44:25.226209   130 Decode.cpp:127] Number of classes (network): 9998
I0605 14:44:26.310541   130 Decode.cpp:134] Number of words: 109339
I0605 14:44:26.471271   130 Decode.cpp:247] [Decoder] LM constructed.
I0605 14:44:27.893460   130 Decode.cpp:271] [Decoder] Trie planted.
I0605 14:44:28.100594   130 Decode.cpp:283] [Decoder] Trie smeared.
I0605 14:44:28.382174   130 W2lListFilesDataset.cpp:141] 1 files found.
I0605 14:44:28.382206   130 Utils.cpp:102] Filtered 0/1 samples
I0605 14:44:28.382220   130 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0605 14:44:28.382467   141 Decode.cpp:480] [Decoder] LexiconSeq2Seq decoder with token-LM loaded in thread: 0
I0605 14:44:32.328685   130 Decode.cpp:718] ------
[Decode decode.csv (1 samples) in 3.94641s (actual decoding time 1.53s/sample) -- WER: 4.54545, LER: 2.91971]
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson ...The audio continues but the transcription stops...
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johan
[sample: 1, WER: 4.54545%, LER: 2.91971%, slice WER: 4.54545%, slice LER: 2.91971%, decoded samples (thread 0): 1]

tlikhomanenko commented 4 years ago

@BoneGoat If you run Viterbi path (Test.cpp) is it working for the whole audio?

BoneGoat commented 4 years ago

@tlikhomanenko Thanks for your reply. Looks like it works better with Test.cpp, still cuts off though.

root@c457cb2ad017:~/wav2letter/build# ./Test --am /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin --maxload 10 --test /root/data/w2l-aligned-experiment/decode.csv --tokensdir /root/data/w2l-aligned-experiment/ --tokens train-4773-unigram-10000.tokens --lexicon /root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon --emission_dir /root/data/w2l-aligned-experiment/emission/ --show true
I0606 08:35:40.293009   282 Test.cpp:83] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=network-seq2seq_tds.arch; --archdir=/root/data/w2l-sb-ibm-aligned-experiment; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=; --dataorder=output_spiral; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=/root/data/w2l-aligned-experiment/emission/; --emission_queue_size=3000; --enable_distributed=false; --encoderdim=512; --eosscore=0; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-seq2seq-tds.cfg; --framesizems=25; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=wav; --inputbinsize=25; --inputfeeding=false; --isbeamdump=false; --iter=200; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcosine=false; --lrcrit=0.050000000000000003; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=10; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=4194304; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/data/w2l-sb-ibm-aligned-experiment; --runname=4773-seq2seq_tds; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=; --seed=0; --show=true; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=/root/data/w2l-aligned-experiment/decode.csv; --tokens=train-4773-unigram-10000.tokens; --tokensdir=/root/data/w2l-aligned-experiment/; --train=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-w2l-noquotes.csv; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/root/data/w2l-sb-ibm-aligned-experiment/dev-4773-w2l-noquotes.csv; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0606 08:35:40.297216   282 Test.cpp:104] Number of classes (network): 9998
I0606 08:35:41.395877   282 Test.cpp:111] Number of words: 109339
I0606 08:35:41.739491   282 W2lListFilesDataset.cpp:141] 1 files found.
I0606 08:35:41.739533   282 Utils.cpp:102] Filtered 0/1 samples
I0606 08:35:41.739547   282 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0606 08:35:41.739553   282 Test.cpp:131] [Dataset] Dataset loaded.
|T|: i _ s t o r t _ s e t t _ a l l a _ g r ä n s k o n t r o l l e r _ i n o m _ e u _ k o m m e r _ a t t _ h ä v a s _ s e n a s t _ d e n _ s i s t a _ j u n i _ d e t _ b e s k e d e t _ g a v _ s v e r i g e s _ e u _ k o m m i s s i o n ä r _ y l v a _ j o h a n s s o n
|P|: i _ s t o r t _ s e t t _ a l l a _ g r ä n s k o n t r o l l e r _ i n o m _ e u _ k o m m e r _ a t t _ h ä v v a _ s e n a s t _ d e n _ s i s t a _ j u n i _ d e t _ b e s k e d e t _ g a v _ s v e r i g e s _ e u k o m m i s s i o n ä r _ y l v a _ j o h a n _ s o m _ s ä g e r _ a t t _ h o n _ f å t t _ e t t _ s t a r k t _ s t ö d _ f r å n _ i _ s t o r t _ s e t t _ s a m t l i g a _ i _ l ä n d e r _ f ö r _ a t t _ g r ä n s k a l e r n a _ s k a _ u p p h ö r a _ s e n a s t _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e
[sample: 1, WER: 281.818%, LER: 300.73%, total WER: 281.818%, total LER: 300.73%, progress (thread 0): 100%]
I0606 08:35:44.827239   282 Test.cpp:317] ------
I0606 08:35:44.827262   282 Test.cpp:318] [Test /root/data/w2l-aligned-experiment/decode.csv (1 samples) in 3.08763s (actual decoding time 3.09s/sample) -- WER: 281.818, LER: 300.73]

tlikhomanenko commented 4 years ago

Your output is strange. As I see |T| is much shorter. Your prediction is much larger. If you listen this audio is your target transcription correct?

tlikhomanenko commented 4 years ago

I see. The problem in the looping, in case of test you can see there are loops (repetitions of ngrams), that is why it is longer. This is known problem in seq2seq.

In case of decoder eosScore controls/prevents this thing, and at the same time can make your sentence shorter. Try to tune eosScore, often we search in [-10, 0].

BoneGoat commented 4 years ago

The transcription is correct. I've tested a few audio clips now and the problem exists for every clip I run through the decoder. I've tested some variations of eosScore. Some of them written down here. The point is that no matter which eosScore I give it it's still the same.

I did not know that Seq2Seq does sentence prediction. The LM is trained on text without punctuation. Would that be a problem?

eosScore: -10

|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni

eosScore: -8

|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni

eosScore: -5

|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni

eosScore: -1

|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni

Having looked at the previous results, manipulating eosScore actually makes the decoder stop even sooner.

I've managed to get the decoder to stop later with this config:

--am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
--tokensdir=/root/data/w2l-aligned-experiment
--tokens=train-4773-unigram-10000.tokens
--lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon
--lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin
--datadir=/root/data/w2l-aligned-experiment
--test=decode.csv
--uselexicon=true
--sclite=/root/data/w2l-aligned-experiment/logs
--decodertype=tkn
--lmweight=1.0
--wordscore=2.0
--beamsize=100
--beamsizetoken=10
--beamthreshold=15
--hardselection=1.5
--softselection=10.0
--attentionthreshold=30
--eosscore=0.0
--silscore=0.0
--nthread_decoder=1
--smearing=max
--noresample=true
--show=true

|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni d beskedet be skydda gav sveriges eu kommissionär ylva johan sång efter ett video möte med eus inrikes ministrar i dag p slutet av här h intresserade so har

Compared to the worse config:

--am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
--tokensdir=/root/data/w2l-aligned-experiment
--tokens=train-4773-unigram-10000.tokens
--lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon
--lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin
--datadir=/root/data/w2l-aligned-experiment
--test=decode.csv
--uselexicon=true
--sclite=/root/data/w2l-aligned-experiment/logs
--decodertype=tkn
--lmweight=1.0
--wordscore=2.0
--beamsize=80
--beamsizetoken=250000
--beamthreshold=7
--hardselection=1.5
--softselection=10.0
--attentionthreshold=30
--eosscore=0.0
--silscore=0.0
--nthread_decoder=1
--smearing=max
--noresample=true
--show=true

Increasing beamthreshold to 20 makes it even better:

|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni di beskedet gav sveriges eu kommission skärv beskedet golf sveriges eu kommissionär u johnson efter ett video möte med eu inrikesminister i dag till sluter av här höra det s ko ha här vi bara suverän travar eller ha hann ledsen gula rimliga versaler hyllar chef vild joe som säger att hon fått ett starkt stöd från i stort sett samtliga elände för att gränskontrollerna ska upphöra senast en trettonde juni så att europeiska medborgare kan begära resa europa igen pådrivande

BoneGoat commented 4 years ago

I continued training the AM and it's now at about 7% WER on the validation set. Using the best config for decoding I get the following results:

I0607 08:50:59.881659  1003 Decode.cpp:58] Reading flags from file /root/data/w2l-aligned-experiment/decode-4773-seq2seq-tds.cfg
I0607 08:50:59.881906  1003 Decode.cpp:75] [Network] Reading acoustic model from /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
I0607 08:51:00.703341  1003 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> output]
    (0): View (-1 80 1 0)
    (1): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (2): ReLU
    (3): Dropout (0.200000)
    (4): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (5): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800]
    (6): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800]
    (7): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (8): ReLU
    (9): Dropout (0.200000)
    (10): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (11): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
    (12): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
    (13): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
    (14): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (15): ReLU
    (16): Dropout (0.200000)
    (17): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (18): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (19): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (20): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (21): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (22): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (23): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
    (24): View (0 1440 1 0)
    (25): Reorder (1,0,3,2)
    (26): Linear (1440->1024) (with bias)
I0607 08:51:00.703388  1003 Decode.cpp:82] [Criterion] Seq2SeqCriterion
I0607 08:51:00.703419  1003 Decode.cpp:84] [Network] Number of params: 36538460
I0607 08:51:00.703438  1003 Decode.cpp:90] [Network] Updating flags from config file: /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
I0607 08:51:00.704461  1003 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=network-seq2seq_tds.arch; --archdir=/root/data/w2l-sb-ibm-aligned-experiment; --attention=keyvalue; --attentionthreshold=30; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=100; --beamsizetoken=10; --beamthreshold=30; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=/root/data/w2l-aligned-experiment; --dataorder=output_spiral; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=tkn; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=false; --encoderdim=512; --eosscore=0; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/data/w2l-aligned-experiment/decode-4773-seq2seq-tds.cfg; --framesizems=25; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=wav; --inputbinsize=25; --inputfeeding=false; --isbeamdump=false; --iter=200; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcosine=false; --lrcrit=0.050000000000000003; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=4194304; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=true; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/data/w2l-sb-ibm-aligned-experiment; --runname=4773-seq2seq_tds; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=/root/data/w2l-aligned-experiment/logs; --seed=0; --show=true; --showletters=false; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=decode.csv; --tokens=train-4773-unigram-10000.tokens; --tokensdir=/root/data/w2l-aligned-experiment; --train=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-w2l-noquotes.csv; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/root/data/w2l-sb-ibm-aligned-experiment/dev-4773-w2l-noquotes.csv; --weightdecay=0; --wordscore=2; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0607 08:51:00.708750  1003 Decode.cpp:127] Number of classes (network): 9998
I0607 08:51:01.906195  1003 Decode.cpp:134] Number of words: 109339
I0607 08:51:02.080243  1003 Decode.cpp:247] [Decoder] LM constructed.
I0607 08:51:03.626993  1003 Decode.cpp:271] [Decoder] Trie planted.
I0607 08:51:03.846594  1003 Decode.cpp:283] [Decoder] Trie smeared.
I0607 08:51:04.156188  1003 W2lListFilesDataset.cpp:141] 1 files found.
I0607 08:51:04.156222  1003 Utils.cpp:102] Filtered 0/1 samples
I0607 08:51:04.156236  1003 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0607 08:51:04.156512  1014 Decode.cpp:480] [Decoder] LexiconSeq2Seq decoder with token-LM loaded in thread: 0
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gavs sveriges eu kommissionär ylva john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eus inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eus inrikes ministrar i dag p beslut
[sample: 1, WER: 409.091%, LER: 327.737%, slice WER: 409.091%, slice LER: 327.737%, decoded samples (thread 0): 1]
I0607 08:51:09.596994  1003 Decode.cpp:718] ------
[Decode decode.csv (1 samples) in 5.4407s (actual decoding time 2.91s/sample) -- WER: 409.091, LER: 327.737]

The transcription starts repeating until it eventually gives up. Worth noting is that it starts repeating exactly when the audio quality get worse. So the audio starts with studio recorded audio then goes to an interview with worse audio quality. Not terrible though, I have no problem hearing what is said. The decoder seems to get very confused with this transition.

Having noted this, I tested another audio clip with very good quality and the transcription still starts repeating after a while...

P|: sverige demokraternas partiledare jim åke som riktar hård kritik mot sveriges corona strategi och kräver att stads kompromiss låg anders kriminella avgår det skriver åkesson på dagens nyheters debattera han skriver att regeringen och folkhälsomyndigheten fått flera chanser att rätta till sina misstag men att inget har hänt och att ansvariga därför bör lämna sina positioner med omedelbar verkan efter åker

|P|: s v e r i g e d e m o k r a t e r n a s _ p a r t i l e d a r e _ j m i _ å k e _ s o m _ r i k t a r _ h å r d _ k r i t i k _ m o t _ s v e r i g e s _ k o r o n a _ s t r a t e g i _ o c h _ k r ä v e r _ a t t _ s t a d s e p i d e m i _ l å g _ a n d e r s _ g n ä l l _ a v g å r _ d e t _ s k r i v e r _ å k e s s o n _ p å _ d a g e n s _ n y h e t e r s _ d e b a t t s i d a _ h a n _ s k r i v e r _ a t t _ r e g e r i n g e n _ a l l t i d _ f a t t a t _ s i n a _ b e s l u t _ u t i f r å n _ d e t _ k u n s k a p s l ä g e _ s o m _ f u n n i t s _ o c h _ a t t _ m a n _ k o m m e r _ a t t _ k o r r i g e r a _ s t r a t e g i n _ v i d _ b e h o v _ e k o t _ c h a r l o t t a _ f o l k e r g i _ o c h _ k r ä v e r _ a t t _ s t a d s e p i d e m i _ l å g _ a n d e r s _ g n ä l l _ a v g å r _ d e t _ s k r i v e r _ å k e s s o n _ p å _ d a g e n s _ n y h e t e r s _ d e b a t t s i d a _ h a n _ s k r i v e r _ a t t _ r e g e r i n g e n _ a l l t i d _ f a t t a t _ s i n a _ b e s l u t _ u t i f r å n _ d e t _ k u n s k a p s l ä g e _ s o m _ f u n n i t s _ o c h _ a t t _ m a n _ k o m m e r

I guess this is what you meant with looping ngrams. So if this is a known problem with Seq2Seq, is there a way to solve it?

tlikhomanenko commented 4 years ago

About your decoder parameters:

you can remove them at all (we removed these flags from the decoder)

--hardselection=1.5
--softselection=10.0

then for s2s this one is not used

--silscore=0.0

For making shorter sentences (to prevent looping) we have eosscore as I said before. Beamthreshold is a heuristic on filtering current hyps which are far from the current best hyp, so increasing it you decrease the filtering - that is why longer hyps can survive. For sure try to play with it. Another param which influence is --attentionthreshold=30 it defines the window for attention (if it is infinity then we will look at the whole sentence, if small - window will be smaller). Possibly you need to play with it too - I guess here could be a problem for attention itself when it sees part of audio with another quality not as in the training. About beamsizetoken - this is for speedup and restricting search, only these top tokens by their emission scores will be considered to build the hyps (doesn't influence on looping, early stopping, just for speed).

I guess this is what you meant with looping ngrams. So if this is a known problem with Seq2Seq, is there a way to solve it?

yep, this is not totally solved problem and in the decoder people adding some penalties to prevent this, like we do for eosScore. It cannot solve for all sentences. Try to decode full set you have and check final wer and how many sentences you have much longer/much shorter.

I would also try to look at the attention values for audios where you have switching to worse quality. As I see overall quality you have (Viterbi) is not bad, so you should have small number of samples with this effect (but this is normal in current life =) ). Here some analysis of badly recognized sentences could be done, possibly you can add more augmentation/these samples type/add heuristics on attention to improve on particular sentences. Don't have ready recipe. You can google any seq2seq papers, for sure people will mention this problem and their penalties/heuristics/etc. (this all depends on loss function, model type, decoder they have).

BoneGoat commented 4 years ago

Thank you very much for your help.

flashlight / wav2letter

Decode does not transcribe entire audio (Seq2Seq) #678