flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Unable to use models exported with streaming_tds_model_converter #573

Closed lunixbochs closed 4 years ago

lunixbochs commented 4 years ago
  1. I modified simple_streaming_asr_example to use very simple Greedy CTC decoding instead of a language model (as I don't have a language model that works with my acoustic model).

  2. I tested this with your published Streaming ConvNet inference model. It works perfectly.

  3. I trained a new Streaming Convnets model, with a new 10k token set.

  4. It works great with the wav2letter Test binary.

  5. I exported it with streaming_tds_model_converter --am 021_model_last.bin using the latest master: 8c56179b0c03c30412779529670a4036c7aae2b9

  6. I took the exported model directory and ran it against a wav file:

    cat file.wav | ./simple_streaming_asr_example -input_files_base_path path/to/export/
  7. The greedy CTC decoder reports that the "best class" for each frame is 9997, which is the CTC blank token. So there's no output. If I run the same modified (Greedy CTC) simple_streaming_asr_example command, with the same input wave file, with your published model, I get a reasonable transcription.

  8. If I run the same wave file with Test against both models, it works fine.

  9. Now! If I re-export the Facebook Streaming ConvNet example model from here:

  10. Then use it with simple_streaming_asr_example, I have the same problem. All frames report blank token as most probable.

What am I missing? Either the exporter is broken, or my flags are wrong?

Here's the output log from running streaming_tds_model_converter on your pre-trained model:

tmp$ ~/build/wav2letter/build/tools/streaming_tds_model_converter --am am_500ms_future_context_dev_other.bin 
I0313 05:30:10.906062 13288 StreamingTDSModelConverter.cpp:164] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/checkpoint/antares/wav2letter/recipes/models/seq2seq_tds/librispeech/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lrcosine=false; --lrcrit=0; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --sclite=; --seed=0; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=; --tokens=tokens.txt; --tokensdir=; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=32; --outdir=; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; 
I0313 05:30:10.909544 13288 StreamingTDSModelConverter.cpp:179] Number of classes (network): 9998
Skipping View module: V -1 NFEAT 1 0
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping Dropout module: DO 0.1
Skipping Dropout module: DO 0.1
Skipping Dropout module: DO 0.1
Skipping Dropout module: DO 0.1
Skipping Reorder module: RO 2 1 0 3
Skipping View module: V 2160 -1 1 0
Skipping View module: V NLABEL 0 -1 1
I0313 05:30:13.972755 13288 StreamingTDSModelConverter.cpp:272] Serializing acoustic model to 'acoustic_model.bin'
I0313 05:30:15.041009 13288 StreamingTDSModelConverter.cpp:284] Writing tokens file to 'tokens.txt'
I0313 05:30:15.042003 13288 StreamingTDSModelConverter.cpp:295] Serializing feature extraction model to 'feature_extractor.bin'
I0313 05:30:15.075234 13288 StreamingTDSModelConverter.cpp:311] verifying serialization ...
I0313 05:30:18.253557 13288 StreamingTDSModelConverter.cpp:339] Done !
joazoa commented 4 years ago

@lunixbochs i see no token file in the flags for the converter and as far I as know, the tokens file is not inside the original model. It is however writing a tokens file to tokens.txt. Does it renegerate them from the lexicon ? Which tokens file are you using ? The original you made, the one from fb model or the file created by the exporter ?

lunixbochs commented 4 years ago

It picked up the tokens from the model's directory using the model's built in flags. I confirmed the tokens are fine. The tokens aren't a factor anyway - the bad exported models classify each frame as blank, which happens before tokens are considered.

lunixbochs commented 4 years ago

I dumped the resulting inference model layers. The weights seem the same, but the layers columns are different:

https://gist.github.com/lunixbochs/342dc47789be3e33c30ce4ddf7320df2

For example, in the first layer:

- Conv1dFbGemm:{base=Conv1d:{inChannels_=80 outChannels_=1200 kernelSize_=10 stride_=2 rightPadding_=3 leftPadding_=5 groups_=80 } packedWeights_=PackedGemmMatrixFP16:{ num_rows:10 ncol:15 block_row_size:512 last_brlock_ow:10 block_col_size:16 num_block_row:1 num_clock_col:1 mat_size:8192 content=

block_col_size:16 -> block_col_size:32
mat_size:8192 -> mat_size:16384

The linear layers are also slightly different:

LinearFbGemm:{base=Linear:{nInput_=2160 nOutput_=2160} packedWeights_=PackedGemmMatrixFP16:{ num_rows:2160 ncol:2160 block_row_size:512 last_brlock_ow:112 block_col_size:16 num_block_row:5 num_clock_col:135 mat_size:5529600}} bias_=ModuleParameter:{type_=FLOAT buffer_=IOBuffer:{name_= offsetInBytes_=0 buf_.size()=8640 sizeInBytes_=8640}}} vs LinearFbGemm:{base=Linear:{nInput_=2160 nOutput_=2160} packedWeights_=PackedGemmMatrixFP16:{ num_rows:2160 ncol:2160 block_row_size:512 last_brlock_ow:112 block_col_size:32 num_block_row:5 num_clock_col:68 mat_size:5570560}} bias_=ModuleParameter:{type_=FLOAT buffer_=IOBuffer:{name_= offsetInBytes_=0 buf_.size()=8640 sizeInBytes_=8640}}}

Is this a problem? Why would this be?

vineelpratap commented 4 years ago

Hi, When you run StreamingTDSModelConverter.ccp, we have a test to make the new serialized model produces the same results are before. Did it pass when you are doing the conversion for your new serialized model ?

lunixbochs commented 4 years ago

I0318 00:28:22.217653 27864 StreamingTDSModelConverter.cpp:311] verifying serialization ... I0318 00:28:26.654184 27864 StreamingTDSModelConverter.cpp:339] Done !

lunixbochs commented 4 years ago

I think maybe there's a platform-specific serialization bug? My newly exported models actually work on the machine I exported them on, but don't work on the machine I copied them to.

Your pre-exported model works in both places.

lunixbochs commented 4 years ago

This is what the model outputs look like:

Working export (machine A):
maxIdx=9997 maxValue=18.7374
maxIdx=9997 maxValue=19.4477
maxIdx=7 maxValue=14.1724
maxIdx=9997 maxValue=21.3105
maxIdx=9997 maxValue=17.0877
maxIdx=9997 maxValue=16.103
maxIdx=9997 maxValue=21.9728
maxIdx=21 maxValue=15.5784
maxIdx=9997 maxValue=23.2172
maxIdx=133 maxValue=17.5036
maxIdx=9997 maxValue=24.9681
maxIdx=9997 maxValue=13.5481
maxIdx=9997 maxValue=21.9547

Working export (machine B):
maxIdx=9997 maxValue=18.7311
maxIdx=9997 maxValue=19.4231
maxIdx=7 maxValue=14.1835
maxIdx=9997 maxValue=21.3038
maxIdx=9997 maxValue=17.0709
maxIdx=9997 maxValue=16.1126
maxIdx=9997 maxValue=22.0041
maxIdx=21 maxValue=15.5785
maxIdx=9997 maxValue=23.1972
maxIdx=133 maxValue=17.4718
maxIdx=9997 maxValue=25.0089
maxIdx=9997 maxValue=13.5653
maxIdx=9997 maxValue=21.9341
My export (machine A):
maxIdx=9997 maxValue=21.6731
maxIdx=9997 maxValue=19.0553
maxIdx=9997 maxValue=19.285
maxIdx=288 maxValue=16.5611
maxIdx=9997 maxValue=19.5208
maxIdx=1 maxValue=15.4871
maxIdx=9997 maxValue=18.0554
maxIdx=1605 maxValue=15.006
maxIdx=9997 maxValue=19.9644
maxIdx=15 maxValue=15.9195
maxIdx=9997 maxValue=20.535
maxIdx=128 maxValue=15.6588

My export (machine B):
maxIdx=9997 maxValue=7.31259
maxIdx=9997 maxValue=7.3065
maxIdx=9997 maxValue=7.36161
maxIdx=9997 maxValue=7.34436
maxIdx=9997 maxValue=7.38665
maxIdx=9997 maxValue=7.37532
maxIdx=9997 maxValue=7.41259
maxIdx=9997 maxValue=7.40084
maxIdx=9997 maxValue=7.41554
maxIdx=9997 maxValue=7.4043
maxIdx=9997 maxValue=7.42162
maxIdx=9997 maxValue=7.4235

Something is really wrong here. This is what some of the labels look like, with blank at the end:

0.163285 0.0443605 -1.00704 -1.24982 1.09511 0.837985 -0.0770629 1.44846 1.47266 -0.456674 -0.922446 -1.00534 0.0227617 7.35806
0.00368934 -0.0220936 -0.0079521 -0.0204602 0.000875052 0.17069 0.0406591 -1.00935 -1.25585 1.08697 0.832637 -0.0810163 1.44864 1.47242 -0.480764 -0.92135 -1.00545 0.0382127 7.36907
vineelpratap commented 4 years ago

Oh no! I'm debugging this ATM. Have a feeling the save/load function here is not platform agnostic - https://github.com/facebookresearch/wav2letter/blob/master/inference/inference/module/nn/backend/fbgemm/PackedGemmMatrixFP16.h

vineelpratap commented 4 years ago

Could you try replacing the above file with https://gist.github.com/vineelpratap/04a50b06074e055001bccf97bf5d3f3a and give a try.

lunixbochs commented 4 years ago

This works!

vineelpratap commented 4 years ago

Okay cool. I'll fix the master in a day or two!

lunixbochs commented 4 years ago

Thanks so much! I'm excited to try my new streaming convnet model for interactive use.

joazoa commented 4 years ago

Thanks @vineelpratap !!

VoThanhDanh95 commented 3 years ago

Thanks so much! I'm excited to try my new streaming convnet model for interactive use.

Hi lunixbochs. @lunixbochs Could you share the result about Decoder and Interactive Model? I spent a lot of time to find the difference between Decoder and Interactive Streaming. Even though I set lmweight = 0 and same beamsize / beamsize token. The final result of the same validated dataset is still different. Is it expected behavior?

This is the setting for Decoder: --uselexicon=false \ --wordseparator=_ \ --beamsize=10 \ --beamsizetoken=1 \ --beamthreshold=100 \ --nthread_decoder=1 \ --lm='' \ --lmtype=kenlm \ --lmweight=0 \ --wordscore 0 \ --eosscore 0 \ --silscore 0 \ --unkscore 0 \ --smearing=max \ --maxload -1 \

And this is decoder.json file for Interactive Streaming { "beamSize" : 10, "beamSizeToken" : 1, "beamThreshold" : 100, "usewordpiece" : true, "lmWeight" : 0, "wordScore" : 0, "unkScore" : 0, "silScore" : 0.0, "eosScore" : 0.0, "smearing" : "max", "logAdd" : false, "criterionType" : "CTC" }

Could you please help me?