flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.37k stars 1.01k forks source link

Getting "call" or "can e can al" as an output for an empty audio file with the trained Acoustic Model #966

Closed vchagari closed 3 years ago

vchagari commented 3 years ago

Issue: I used 70mil parameters AM as a Base model for training with the set of synthetic data and getting output as "call" or "can e can al" when I run the decoder with the trained AM model even though the test audio file is completely empty (no words in the audio, just silence).

Note: When I run decoder with the 70Mil AM (base model) with the test audio file (silence audio wav file), i get the correct output (empty output), but when I use the trained AM bin it shows wrong output (non empty output), see below for more details

Details:

  1. Base AM: am_transformer_ctc_stride3_letters_70Mparams.bin Location: https://github.com/facebookresearch/wav2letter/tree/master/recipes/rasr
  2. Arch: https://dl.fbaipublicfiles.com/wav2letter/rasr/tutorial/am_transformer_ctc_stride3_letters_70Mparams.arch

Training set: Data contains some random names and commands, like the following: call justin add contact michael call vincent add nancy to the group call daddy say hello to willam call 123456789 greet justin

More Details:

Training: Cmd: /data/flashlight_master/flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc /data/synth_data/am_transformer_ctc_stride3_letters_70Mparams.bin --flagsfile /data/synth_data/train.cfg Config: { "config": [ { "key": "runIdx", "value": "1" }, { "key": "timestamp", "value": "2021-04-13, 2021-04-13" }, { "key": "hostname", "value": "" }, { "key": "runPath", "value": "/data/synth_data/rundir" }, { "key": "username", "value": "test" }, { "key": "gflags", "value": "--flagfile=\n--fromenv=\n--tryfromenv=\n--undefok=\n--tab_completion_columns=80\n--tab_completion_word=\n--help=false\n--helpfull=false\n--helpmatch=\n--helpon=\n--helppackage=false\n--helpshort=false\n--helpxml=false\n--version=false\n--adambeta1=0.94999999999999996\n--adambeta2=0.98999999999999999\n--am=\n--am_decoder_tr_dropout=0.20000000000000001\n--am_decoder_tr_layerdrop=0.20000000000000001\n--am_decoder_tr_layers=6\n--arch=/data/synth_data/arch.txt\n--attention=keyvalue\n--attentionthreshold=2147483647\n--attnWindow=softPretrain\n--attnconvchannel=0\n--attnconvkernel=0\n--attndim=0\n--batching_max_duration=0\n--batching_strategy=none\n--batchsize=4\n--beamsize=2500\n--beamsizetoken=250000\n--beamthreshold=25\n--channels=1\n--criterion=ctc\n--critoptim=adagrad\n--datadir=/data/synth_data/lists\n--decoderattnround=1\n--decoderdropout=0\n--decoderrnnlayer=1\n--decodertype=wrd\n--devwin=0\n--emission_dir=\n--emission_queue_size=3000\n--enable_distributed=true\n--encoderdim=256\n--eosscore=0\n--everstoredb=false\n--features_type=mfsc\n--fftcachesize=1\n--filterbanks=80\n--fl_amp_max_scale_factor=32000\n--fl_amp_scale_factor=4096\n--fl_amp_scale_factor_update_interval=2000\n--fl_amp_use_mixed_precision=false\n--fl_benchmark_mode=true\n--fl_log_level=\n--fl_log_mem_ops_interval=0\n--fl_optim_mode=\n--fl_vlog_level=0\n--flagsfile=/data/synth_data/train.cfg\n--framesizems=25\n--framestridems=10\n--gamma=1\n--gumbeltemperature=1\n--highfreqfilterbank=-1\n--inputfeeding=false\n--isbeamdump=false\n--iter=100000000\n--itersave=true\n--labelsmooth=0.050000000000000003\n--leftWindowSize=50\n--lexicon=/data/synth_data/lexicon_complete_data.txt\n--linlr=-1\n--linlrcrit=-1\n--linseg=0\n--lm=\n--lm_memory=5000\n--lm_vocab=\n--lmtype=kenlm\n--lmweight=0\n--lmweight_high=4\n--lmweight_low=0\n--lmweight_step=0.20000000000000001\n--localnrmlleftctx=0\n--localnrmlrightctx=0\n--logadd=false\n--lowfreqfilterbank=0\n--lr=0.025000000000000001\n--lr_decay=100\n--lr_decay_step=50\n--lrcosine=false\n--lrcrit=0.02\n--max_devices_per_node=8\n--maxdecoderoutputlen=400\n--maxgradnorm=0.10000000000000001\n--maxload=-1\n--maxrate=10\n--maxsil=50\n--maxword=-1\n--melfloor=1\n--mfcccoeffs=13\n--minrate=3\n--minsil=0\n--momentum=0.80000000000000004\n--netoptim=sgd\n--nthread=8\n--nthread_decoder=1\n--nthread_decoder_am_forward=1\n--numattnhead=8\n--onorm=target\n--optimepsilon=1e-08\n--optimrho=0.90000000000000002\n--pctteacherforcing=99\n--pcttraineval=1\n--pretrainWindow=0\n--replabel=0\n--reportiters=1000\n--rightWindowSize=50\n--rndv_filepath=\n--rundir=/data/synth_data/rundir\n--samplerate=16000\n--sampletarget=0.01\n--samplingstrategy=rand\n--saug_fmaskf=30\n--saug_fmaskn=2\n--saug_start_update=24000\n--saug_tmaskn=10\n--saug_tmaskp=0.050000000000000003\n--saug_tmaskt=30\n--sclite=\n--seed=0\n--sfx_config=\n--sfx_start_update=2147483647\n--show=false\n--showletters=false\n--silscore=0\n--smearing=none\n--smoothingtemperature=1\n--softwoffset=10\n--softwrate=5\n--softwstd=4\n--sqnorm=true\n--stepsize=9223372036854775807\n--surround=\n--test=\n--tokens=/data/synth_data/tokens.txt\n--train=180k_new_us1us2us3_newus1us2_nocall_train.lst\n--trainWithWindow=true\n--transdiag=0\n--unkscore=-inf\n--use_memcache=false\n--uselexicon=true\n--usewordpiece=false\n--valid=180k_new_us1us2us3_newus1us2_nocall_dev.lst\n--validbatchsize=-1\n--warmup=48000\n--weightdecay=0\n--wordscore=0\n--wordseparator=|\n--world_rank=0\n--world_size=1\n--alsologtoemail=\n--alsologtostderr=false\n--colorlogtostderr=false\n--drop_log_memory=true\n--log_backtrace_at=\n--log_dir=\n--log_link=\n--log_prefix=true\n--logbuflevel=0\n--logbufsecs=30\n--logemaillevel=999\n--logfile_mode=436\n--logmailer=/bin/mail\n--logtostderr=true\n--max_log_size=1800\n--minloglevel=0\n--stderrthreshold=2\n--stop_logging_if_full_disk=false\n--symbolize_stacktrace=true\n--v=0\n--vmodule=\n" }, { "key": "commandline", "value": "/data/flashlight_master/flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc /data/synth_data/am_transformer_ctc_stride3_letters_70Mparams.bin --flagsfile /data/synth_data/train.cfg" }, { "key": "programname", "value": "/data/flashlight_master/flashlight/build/bin/asr/fl_asr_tutorial_finetune_ctc" } ]

Decoder: Cmd: /data/flashlight_master/flashlight/build/bin/asr/fl_asr_decode --am /data/synth_data/rundir/001_model_iter_003.bin --test test.lst --lexicon /data/synth_data/lexicon.txt --tokens /data/synth_data/tokens.txt --lm /data/synth_data/lm_03.arpa --beamsize 10000 --beamthreshold 30 --lmweight 2 --beamsizetoken 25 --nthread_decoder 8 --lmtype kenlm --wordscore 0 --eosscore 0 --silscore 0 --unkscore -Infinity --smearing max --uselexicon true --datadir /data/synth_data --show --showletters

I0414 17:46:51.280470 28352 CachingMemoryManager.cpp:114 CachingMemoryManager recyclingSizeLimit=18446744073709551615 (16777216.00 TiB) splitSizeLimit=18446744073709551615 (16777216.00 TiB) I0414 17:46:51.538349 19329 Decode.cpp:136] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.94999999999999996; --adambeta2=0.98999999999999999; --am=/data/synth_data/rundir/001_model_iter_003.bin; --am_decoder_tr_dropout=0.20000000000000001; --am_decoder_tr_layerdrop=0.20000000000000001; --am_decoder_tr_layers=6; --arch=/data/synth_data/arch.txt; --attention=keyvalue; --attentionthreshold=2147483647; --attnWindow=softPretrain; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batching_max_duration=0; --batching_strategy=none; --batchsize=4; --beamsize=10000; --beamsizetoken=25; --beamthreshold=30; --channels=1; --criterion=ctc; --critoptim=adagrad; --datadir=/data/synth_data03_23_2021/; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=256; --eosscore=0; --everstoredb=false; --features_type=mfsc; --fftcachesize=1; --filterbanks=80; --fl_amp_max_scale_factor=32000; --fl_amp_scale_factor=4096; --fl_amp_scale_factor_update_interval=2000; --fl_amp_use_mixed_precision=false; --fl_benchmark_mode=true; --fl_log_level=; --fl_log_mem_ops_interval=0; --fl_optim_mode=; --fl_vlog_level=0; --flagsfile=/data/synth_data/train.cfg; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --highfreqfilterbank=-1; --inputfeeding=false; --isbeamdump=false; --iter=100000000; --itersave=true; --labelsmooth=0.050000000000000003; --leftWindowSize=50; --lexicon=/data/synth_data/lexicon.txt; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=/data/synth_data/lm_03.arpa; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=2; --lmweight_high=4; --lmweight_low=0; --lmweight_step=0.20000000000000001; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lowfreqfilterbank=0; --lr=0.025000000000000001; --lr_decay=100; --lr_decay_step=50; --lrcosine=false; --lrcrit=0.02; --max_devices_per_node=8; --maxdecoderoutputlen=400; --maxgradnorm=0.10000000000000001; --maxload=-1; --maxrate=10; --maxsil=50; --maxword=-1; --melfloor=1; --mfcccoeffs=13; --minrate=3; --minsil=0; --momentum=0.80000000000000004; --netoptim=sgd; --nthread=8; --nthread_decoder=8; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --pctteacherforcing=99; --pcttraineval=1; --pretrainWindow=0; --replabel=0; --reportiters=1000; --rightWindowSize=50; --rndv_filepath=; --rundir=/data/synth_data/rundir; --samplerate=16000; --sampletarget=0.01; --samplingstrategy=rand; --saug_fmaskf=30; --saug_fmaskn=2; --saug_start_update=24000; --saug_tmaskn=10; --saug_tmaskp=0.050000000000000003; --saug_tmaskt=30; --sclite=; --seed=0; --sfx_config=; --sfx_start_update=2147483647; --show=true; --showletters=true; --silscore=0; --smearing=max; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=4; --sqnorm=true; --stepsize=9223372036854775807; --surround=; --test=test.lst; --tokens=/data/synth_data/tokens.txt; --train=train.lst; --trainWithWindow=true; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=false; --valid=dev.lst; --validbatchsize=-1; --warmup=48000; --weightdecay=0; --wordscore=0; --wordseparator=|; --world_rank=0; --world_size=1; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I0414 17:46:51.538650 19329 Decode.cpp:160] Number of classes (network): 29 I0414 17:46:51.539192 19329 Decode.cpp:167] Number of words: 910 Loading the LM will be faster if you build a binary file. Reading /data/synth_data/lm_03.arpa ----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100


I0414 17:46:51.540093 19329 Decode.cpp:276] [Decoder] LM constructed. I0414 17:46:51.540910 19329 Decode.cpp:295] [Decoder] Trie smeared. I0414 17:46:51.541504 19329 Decode.cpp:353] [Dataset] Dataset loaded, with 1 samples. I0414 17:46:51.542213 19341 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 0 I0414 17:46:51.542218 19343 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 5 I0414 17:46:51.542253 19344 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 1 I0414 17:46:51.542253 19339 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 2 I0414 17:46:51.542275 19346 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 7 I0414 17:46:51.542275 19345 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 6 I0414 17:46:51.542253 19342 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 4 I0414 17:46:51.542280 19340 Decode.cpp:597] [Decoder] Lexicon decoder with wrd-LM loaded in thread: 3 |T|: |P|: call |t|: |p|: | | | c a l l | | | | [sample: silence_2s.wav, WER: inf%, TER: inf%, slice WER: inf%, slice TER: inf%, decoded samples (thread 3): 1]

[Decode test.lst (1 samples) in 4.13012s (actual decoding time 1.28s/sample) -- WER: inf%, TER: inf%]

Note:

  1. AM: Trained AM i) lmweight= 2 or 3, Decoder output: |T|: |P|: call |t|: |p|: | | | c a l l | | | | ii) lmweight=0, Decoder output: |T|: |P|: can e can al |t|: |p|: c a n | e | | | c a n | a l | | | |

  2. 70 MIL AM BIN (Base Model): lmweight= 0 or 1 or 2.., Decoder output: |T|: |P|: |t|: |p|:

Platform and Hardware OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu version - 18.04 LTS Python version: Python 3.6.9 Bazel version (if compiling from source): N/A GCC/Compiler version (if compiling from source): N/A CUDA/cuDNN version: 10.1/7.6.4.38 GPU model and memory: NVIDIA-SMI 460.27.04 Driver Version: 460.27.04

Additional Context

  1. Flashlight Commit: c1bdf0b0735d191e5734c4389f360d316908b665 (HEAD -> master, origin/master, origin/HEAD) Author: Your Name you@example.com Date: Fri Feb 26 14:19:46 2021 -0800

    Set CachingMemoryManager anti-fragmentation values via environment va… (#434)

  2. Array Fire Commit: d9d9b6584029e0b480875cdcf35f3238d43ac0e0 (HEAD, tag: v3.7.1) Author: Umar Arshad umar@arrayfire.com Date: Fri Mar 27 19:19:16 2020 -0400

    Update package version to 3.7.1

tlikhomanenko commented 3 years ago

Closing as duplication of https://github.com/facebookresearch/flashlight/issues/539. Please report all error now in flashlight repo directly as you use newest version after w2l was migrated into fl.