Closed BoneGoat closed 4 years ago
@BoneGoat If you run Viterbi path (Test.cpp) is it working for the whole audio?
@tlikhomanenko Thanks for your reply. Looks like it works better with Test.cpp, still cuts off though.
root@c457cb2ad017:~/wav2letter/build# ./Test --am /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin --maxload 10 --test /root/data/w2l-aligned-experiment/decode.csv --tokensdir /root/data/w2l-aligned-experiment/ --tokens train-4773-unigram-10000.tokens --lexicon /root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon --emission_dir /root/data/w2l-aligned-experiment/emission/ --show true
I0606 08:35:40.293009 282 Test.cpp:83] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=network-seq2seq_tds.arch; --archdir=/root/data/w2l-sb-ibm-aligned-experiment; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=; --dataorder=output_spiral; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=/root/data/w2l-aligned-experiment/emission/; --emission_queue_size=3000; --enable_distributed=false; --encoderdim=512; --eosscore=0; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-seq2seq-tds.cfg; --framesizems=25; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=wav; --inputbinsize=25; --inputfeeding=false; --isbeamdump=false; --iter=200; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcosine=false; --lrcrit=0.050000000000000003; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=10; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=4194304; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/data/w2l-sb-ibm-aligned-experiment; --runname=4773-seq2seq_tds; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=; --seed=0; --show=true; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=/root/data/w2l-aligned-experiment/decode.csv; --tokens=train-4773-unigram-10000.tokens; --tokensdir=/root/data/w2l-aligned-experiment/; --train=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-w2l-noquotes.csv; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/root/data/w2l-sb-ibm-aligned-experiment/dev-4773-w2l-noquotes.csv; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0606 08:35:40.297216 282 Test.cpp:104] Number of classes (network): 9998
I0606 08:35:41.395877 282 Test.cpp:111] Number of words: 109339
I0606 08:35:41.739491 282 W2lListFilesDataset.cpp:141] 1 files found.
I0606 08:35:41.739533 282 Utils.cpp:102] Filtered 0/1 samples
I0606 08:35:41.739547 282 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0606 08:35:41.739553 282 Test.cpp:131] [Dataset] Dataset loaded.
|T|: i _ s t o r t _ s e t t _ a l l a _ g r ä n s k o n t r o l l e r _ i n o m _ e u _ k o m m e r _ a t t _ h ä v a s _ s e n a s t _ d e n _ s i s t a _ j u n i _ d e t _ b e s k e d e t _ g a v _ s v e r i g e s _ e u _ k o m m i s s i o n ä r _ y l v a _ j o h a n s s o n
|P|: i _ s t o r t _ s e t t _ a l l a _ g r ä n s k o n t r o l l e r _ i n o m _ e u _ k o m m e r _ a t t _ h ä v v a _ s e n a s t _ d e n _ s i s t a _ j u n i _ d e t _ b e s k e d e t _ g a v _ s v e r i g e s _ e u k o m m i s s i o n ä r _ y l v a _ j o h a n _ s o m _ s ä g e r _ a t t _ h o n _ f å t t _ e t t _ s t a r k t _ s t ö d _ f r å n _ i _ s t o r t _ s e t t _ s a m t l i g a _ i _ l ä n d e r _ f ö r _ a t t _ g r ä n s k a l e r n a _ s k a _ u p p h ö r a _ s e n a s t _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e s t r u k t i o n e r n a _ s k a _ u p p h ö r _ s e n a s t s _ e n _ t r e t t i o t a b e l l _ f ö r _ n ä r _ r e s e r e r e
[sample: 1, WER: 281.818%, LER: 300.73%, total WER: 281.818%, total LER: 300.73%, progress (thread 0): 100%]
I0606 08:35:44.827239 282 Test.cpp:317] ------
I0606 08:35:44.827262 282 Test.cpp:318] [Test /root/data/w2l-aligned-experiment/decode.csv (1 samples) in 3.08763s (actual decoding time 3.09s/sample) -- WER: 281.818, LER: 300.73]
Your output is strange. As I see |T| is much shorter. Your prediction is much larger. If you listen this audio is your target transcription correct?
I see. The problem in the looping, in case of test you can see there are loops (repetitions of ngrams), that is why it is longer. This is known problem in seq2seq.
In case of decoder eosScore
controls/prevents this thing, and at the same time can make your sentence shorter. Try to tune eosScore
, often we search in [-10, 0].
The transcription is correct. I've tested a few audio clips now and the problem exists for every clip I run through the decoder. I've tested some variations of eosScore. Some of them written down here. The point is that no matter which eosScore I give it it's still the same.
I did not know that Seq2Seq does sentence prediction. The LM is trained on text without punctuation. Would that be a problem?
eosScore: -10
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni
eosScore: -8
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni
eosScore: -5
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni
eosScore: -1
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni
Having looked at the previous results, manipulating eosScore actually makes the decoder stop even sooner.
I've managed to get the decoder to stop later with this config:
--am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
--tokensdir=/root/data/w2l-aligned-experiment
--tokens=train-4773-unigram-10000.tokens
--lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon
--lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin
--datadir=/root/data/w2l-aligned-experiment
--test=decode.csv
--uselexicon=true
--sclite=/root/data/w2l-aligned-experiment/logs
--decodertype=tkn
--lmweight=1.0
--wordscore=2.0
--beamsize=100
--beamsizetoken=10
--beamthreshold=15
--hardselection=1.5
--softselection=10.0
--attentionthreshold=30
--eosscore=0.0
--silscore=0.0
--nthread_decoder=1
--smearing=max
--noresample=true
--show=true
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni d beskedet be skydda gav sveriges eu kommissionär ylva johan sång efter ett video möte med eus inrikes ministrar i dag p slutet av här h intresserade so har
Compared to the worse config:
--am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
--tokensdir=/root/data/w2l-aligned-experiment
--tokens=train-4773-unigram-10000.tokens
--lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon
--lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin
--datadir=/root/data/w2l-aligned-experiment
--test=decode.csv
--uselexicon=true
--sclite=/root/data/w2l-aligned-experiment/logs
--decodertype=tkn
--lmweight=1.0
--wordscore=2.0
--beamsize=80
--beamsizetoken=250000
--beamthreshold=7
--hardselection=1.5
--softselection=10.0
--attentionthreshold=30
--eosscore=0.0
--silscore=0.0
--nthread_decoder=1
--smearing=max
--noresample=true
--show=true
Increasing beamthreshold to 20 makes it even better:
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni di beskedet gav sveriges eu kommission skärv beskedet golf sveriges eu kommissionär u johnson efter ett video möte med eu inrikesminister i dag till sluter av här höra det s ko ha här vi bara suverän travar eller ha hann ledsen gula rimliga versaler hyllar chef vild joe som säger att hon fått ett starkt stöd från i stort sett samtliga elände för att gränskontrollerna ska upphöra senast en trettonde juni så att europeiska medborgare kan begära resa europa igen pådrivande
I continued training the AM and it's now at about 7% WER on the validation set. Using the best config for decoding I get the following results:
I0607 08:50:59.881659 1003 Decode.cpp:58] Reading flags from file /root/data/w2l-aligned-experiment/decode-4773-seq2seq-tds.cfg
I0607 08:50:59.881906 1003 Decode.cpp:75] [Network] Reading acoustic model from /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
I0607 08:51:00.703341 1003 Decode.cpp:79] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> output]
(0): View (-1 80 1 0)
(1): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(2): ReLU
(3): Dropout (0.200000)
(4): LayerNorm ( axis : { 0 1 2 } , size : -1)
(5): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800]
(6): Time-Depth Separable Block (21, 80, 10) [800 -> 800 -> 800]
(7): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(8): ReLU
(9): Dropout (0.200000)
(10): LayerNorm ( axis : { 0 1 2 } , size : -1)
(11): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
(12): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
(13): Time-Depth Separable Block (21, 80, 14) [1120 -> 1120 -> 1120]
(14): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(15): ReLU
(16): Dropout (0.200000)
(17): LayerNorm ( axis : { 0 1 2 } , size : -1)
(18): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
(19): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
(20): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
(21): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
(22): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
(23): Time-Depth Separable Block (21, 80, 18) [1440 -> 1440 -> 1440]
(24): View (0 1440 1 0)
(25): Reorder (1,0,3,2)
(26): Linear (1440->1024) (with bias)
I0607 08:51:00.703388 1003 Decode.cpp:82] [Criterion] Seq2SeqCriterion
I0607 08:51:00.703419 1003 Decode.cpp:84] [Network] Number of params: 36538460
I0607 08:51:00.703438 1003 Decode.cpp:90] [Network] Updating flags from config file: /root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin
I0607 08:51:00.704461 1003 Decode.cpp:106] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/data/w2l-aligned-experiment/001_model_#root#data#w2l-sb-ibm-aligned-experiment#dev-4773-w2l-noquotes.csv.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=network-seq2seq_tds.arch; --archdir=/root/data/w2l-sb-ibm-aligned-experiment; --attention=keyvalue; --attentionthreshold=30; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=16; --beamsize=100; --beamsizetoken=10; --beamthreshold=30; --blobdata=false; --channels=1; --criterion=seq2seq; --critoptim=sgd; --datadir=/root/data/w2l-aligned-experiment; --dataorder=output_spiral; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=tkn; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=false; --encoderdim=512; --eosscore=0; --eostoken=true; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/data/w2l-aligned-experiment/decode-4773-seq2seq-tds.cfg; --framesizems=25; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=wav; --inputbinsize=25; --inputfeeding=false; --isbeamdump=false; --iter=200; --itersave=false; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/root/data/w2l-aligned-experiment/train-4773-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/root/data/w2l-aligned-experiment/isidor-article-dump-191022-utf8-clean-nospecialchars_6gram_pruning_000012.bin; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=1; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lrcosine=false; --lrcrit=0.050000000000000003; --maxdecoderoutputlen=120; --maxgradnorm=15; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=4194304; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0; --netoptim=sgd; --noresample=true; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=none; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=99; --pcttraineval=1; --pow=false; --pretrainWindow=3; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/data/w2l-sb-ibm-aligned-experiment; --runname=4773-seq2seq_tds; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --sclite=/root/data/w2l-aligned-experiment/logs; --seed=0; --show=true; --showletters=false; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=false; --stepsize=40; --surround=; --tag=; --target=ltr; --test=decode.csv; --tokens=train-4773-unigram-10000.tokens; --tokensdir=/root/data/w2l-aligned-experiment; --train=/root/data/w2l-sb-ibm-aligned-experiment/train-4773-w2l-noquotes.csv; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/root/data/w2l-sb-ibm-aligned-experiment/dev-4773-w2l-noquotes.csv; --weightdecay=0; --wordscore=2; --wordseparator=_; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I0607 08:51:00.708750 1003 Decode.cpp:127] Number of classes (network): 9998
I0607 08:51:01.906195 1003 Decode.cpp:134] Number of words: 109339
I0607 08:51:02.080243 1003 Decode.cpp:247] [Decoder] LM constructed.
I0607 08:51:03.626993 1003 Decode.cpp:271] [Decoder] Trie planted.
I0607 08:51:03.846594 1003 Decode.cpp:283] [Decoder] Trie smeared.
I0607 08:51:04.156188 1003 W2lListFilesDataset.cpp:141] 1 files found.
I0607 08:51:04.156222 1003 Utils.cpp:102] Filtered 0/1 samples
I0607 08:51:04.156236 1003 W2lListFilesDataset.cpp:62] Total batches (i.e. iters): 1
I0607 08:51:04.156512 1014 Decode.cpp:480] [Decoder] LexiconSeq2Seq decoder with token-LM loaded in thread: 0
|T|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gav sveriges eu kommissionär ylva johansson
|P|: i stort sett alla gränskontroller inom eu kommer att hävas senast den sista juni det beskedet gavs sveriges eu kommissionär ylva john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eu inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eus inrikes ministrar i dag p beslutade eu john sång efter ett video möte med eus inrikes ministrar i dag p beslut
[sample: 1, WER: 409.091%, LER: 327.737%, slice WER: 409.091%, slice LER: 327.737%, decoded samples (thread 0): 1]
I0607 08:51:09.596994 1003 Decode.cpp:718] ------
[Decode decode.csv (1 samples) in 5.4407s (actual decoding time 2.91s/sample) -- WER: 409.091, LER: 327.737]
The transcription starts repeating until it eventually gives up. Worth noting is that it starts repeating exactly when the audio quality get worse. So the audio starts with studio recorded audio then goes to an interview with worse audio quality. Not terrible though, I have no problem hearing what is said. The decoder seems to get very confused with this transition.
Having noted this, I tested another audio clip with very good quality and the transcription still starts repeating after a while...
P|: sverige demokraternas partiledare jim åke som riktar hård kritik mot sveriges corona strategi och kräver att stads kompromiss låg anders kriminella avgår det skriver åkesson på dagens nyheters debattera han skriver att regeringen och folkhälsomyndigheten fått flera chanser att rätta till sina misstag men att inget har hänt och att ansvariga därför bör lämna sina positioner med omedelbar verkan efter åker
|P|: s v e r i g e d e m o k r a t e r n a s _ p a r t i l e d a r e _ j m i _ å k e _ s o m _ r i k t a r _ h å r d _ k r i t i k _ m o t _ s v e r i g e s _ k o r o n a _ s t r a t e g i _ o c h _ k r ä v e r _ a t t _ s t a d s e p i d e m i _ l å g _ a n d e r s _ g n ä l l _ a v g å r _ d e t _ s k r i v e r _ å k e s s o n _ p å _ d a g e n s _ n y h e t e r s _ d e b a t t s i d a _ h a n _ s k r i v e r _ a t t _ r e g e r i n g e n _ a l l t i d _ f a t t a t _ s i n a _ b e s l u t _ u t i f r å n _ d e t _ k u n s k a p s l ä g e _ s o m _ f u n n i t s _ o c h _ a t t _ m a n _ k o m m e r _ a t t _ k o r r i g e r a _ s t r a t e g i n _ v i d _ b e h o v _ e k o t _ c h a r l o t t a _ f o l k e r g i _ o c h _ k r ä v e r _ a t t _ s t a d s e p i d e m i _ l å g _ a n d e r s _ g n ä l l _ a v g å r _ d e t _ s k r i v e r _ å k e s s o n _ p å _ d a g e n s _ n y h e t e r s _ d e b a t t s i d a _ h a n _ s k r i v e r _ a t t _ r e g e r i n g e n _ a l l t i d _ f a t t a t _ s i n a _ b e s l u t _ u t i f r å n _ d e t _ k u n s k a p s l ä g e _ s o m _ f u n n i t s _ o c h _ a t t _ m a n _ k o m m e r
I guess this is what you meant with looping ngrams. So if this is a known problem with Seq2Seq, is there a way to solve it?
About your decoder parameters:
you can remove them at all (we removed these flags from the decoder)
--hardselection=1.5
--softselection=10.0
then for s2s this one is not used
--silscore=0.0
For making shorter sentences (to prevent looping) we have eosscore as I said before. Beamthreshold is a heuristic on filtering current hyps which are far from the current best hyp, so increasing it you decrease the filtering - that is why longer hyps can survive. For sure try to play with it. Another param which influence is --attentionthreshold=30
it defines the window for attention (if it is infinity then we will look at the whole sentence, if small - window will be smaller). Possibly you need to play with it too - I guess here could be a problem for attention itself when it sees part of audio with another quality not as in the training. About beamsizetoken - this is for speedup and restricting search, only these top tokens by their emission scores will be considered to build the hyps (doesn't influence on looping, early stopping, just for speed).
I guess this is what you meant with looping ngrams. So if this is a known problem with Seq2Seq, is there a way to solve it?
yep, this is not totally solved problem and in the decoder people adding some penalties to prevent this, like we do for eosScore. It cannot solve for all sentences. Try to decode full set you have and check final wer and how many sentences you have much longer/much shorter.
I would also try to look at the attention values for audios where you have switching to worse quality. As I see overall quality you have (Viterbi) is not bad, so you should have small number of samples with this effect (but this is normal in current life =) ). Here some analysis of badly recognized sentences could be done, possibly you can add more augmentation/these samples type/add heuristics on attention to improve on particular sentences. Don't have ready recipe. You can google any seq2seq papers, for sure people will mention this problem and their penalties/heuristics/etc. (this all depends on loss function, model type, decoder they have).
Thank you very much for your help.
I'm trying to transcribe an audio clip which is about 2min long with my Seq2Seq model. The transcription is almost perfect but it stops after a few seconds. How can I decode the entire clip?
I've tested a few clips and it seems that the decoding stops when the audio quality changes. W2L seems to stop decoding when going from studio recorded audio to a phone interview. Is there a way to keep decoding?