Closed lunixbochs closed 4 years ago
@lunixbochs i see no token file in the flags for the converter and as far I as know, the tokens file is not inside the original model. It is however writing a tokens file to tokens.txt. Does it renegerate them from the lexicon ? Which tokens file are you using ? The original you made, the one from fb model or the file created by the exporter ?
It picked up the tokens from the model's directory using the model's built in flags. I confirmed the tokens are fine. The tokens aren't a factor anyway - the bad exported models classify each frame as blank, which happens before tokens are considered.
I dumped the resulting inference model layers. The weights seem the same, but the layers columns are different:
https://gist.github.com/lunixbochs/342dc47789be3e33c30ce4ddf7320df2
For example, in the first layer:
- Conv1dFbGemm:{base=Conv1d:{inChannels_=80 outChannels_=1200 kernelSize_=10 stride_=2 rightPadding_=3 leftPadding_=5 groups_=80 } packedWeights_=PackedGemmMatrixFP16:{ num_rows:10 ncol:15 block_row_size:512 last_brlock_ow:10 block_col_size:16 num_block_row:1 num_clock_col:1 mat_size:8192 content=
block_col_size:16 -> block_col_size:32
mat_size:8192 -> mat_size:16384
The linear layers are also slightly different:
LinearFbGemm:{base=Linear:{nInput_=2160 nOutput_=2160} packedWeights_=PackedGemmMatrixFP16:{ num_rows:2160 ncol:2160 block_row_size:512 last_brlock_ow:112 block_col_size:16 num_block_row:5 num_clock_col:135 mat_size:5529600}} bias_=ModuleParameter:{type_=FLOAT buffer_=IOBuffer:{name_= offsetInBytes_=0 buf_.size()=8640 sizeInBytes_=8640}}}
vs
LinearFbGemm:{base=Linear:{nInput_=2160 nOutput_=2160} packedWeights_=PackedGemmMatrixFP16:{ num_rows:2160 ncol:2160 block_row_size:512 last_brlock_ow:112 block_col_size:32 num_block_row:5 num_clock_col:68 mat_size:5570560}} bias_=ModuleParameter:{type_=FLOAT buffer_=IOBuffer:{name_= offsetInBytes_=0 buf_.size()=8640 sizeInBytes_=8640}}}
Is this a problem? Why would this be?
Hi, When you run StreamingTDSModelConverter.ccp, we have a test to make the new serialized model produces the same results are before. Did it pass when you are doing the conversion for your new serialized model ?
I0318 00:28:22.217653 27864 StreamingTDSModelConverter.cpp:311] verifying serialization ... I0318 00:28:26.654184 27864 StreamingTDSModelConverter.cpp:339] Done !
I think maybe there's a platform-specific serialization bug? My newly exported models actually work on the machine I exported them on, but don't work on the machine I copied them to.
Your pre-exported model works in both places.
This is what the model outputs look like:
Working export (machine A):
maxIdx=9997 maxValue=18.7374
maxIdx=9997 maxValue=19.4477
maxIdx=7 maxValue=14.1724
maxIdx=9997 maxValue=21.3105
maxIdx=9997 maxValue=17.0877
maxIdx=9997 maxValue=16.103
maxIdx=9997 maxValue=21.9728
maxIdx=21 maxValue=15.5784
maxIdx=9997 maxValue=23.2172
maxIdx=133 maxValue=17.5036
maxIdx=9997 maxValue=24.9681
maxIdx=9997 maxValue=13.5481
maxIdx=9997 maxValue=21.9547
Working export (machine B):
maxIdx=9997 maxValue=18.7311
maxIdx=9997 maxValue=19.4231
maxIdx=7 maxValue=14.1835
maxIdx=9997 maxValue=21.3038
maxIdx=9997 maxValue=17.0709
maxIdx=9997 maxValue=16.1126
maxIdx=9997 maxValue=22.0041
maxIdx=21 maxValue=15.5785
maxIdx=9997 maxValue=23.1972
maxIdx=133 maxValue=17.4718
maxIdx=9997 maxValue=25.0089
maxIdx=9997 maxValue=13.5653
maxIdx=9997 maxValue=21.9341
My export (machine A):
maxIdx=9997 maxValue=21.6731
maxIdx=9997 maxValue=19.0553
maxIdx=9997 maxValue=19.285
maxIdx=288 maxValue=16.5611
maxIdx=9997 maxValue=19.5208
maxIdx=1 maxValue=15.4871
maxIdx=9997 maxValue=18.0554
maxIdx=1605 maxValue=15.006
maxIdx=9997 maxValue=19.9644
maxIdx=15 maxValue=15.9195
maxIdx=9997 maxValue=20.535
maxIdx=128 maxValue=15.6588
My export (machine B):
maxIdx=9997 maxValue=7.31259
maxIdx=9997 maxValue=7.3065
maxIdx=9997 maxValue=7.36161
maxIdx=9997 maxValue=7.34436
maxIdx=9997 maxValue=7.38665
maxIdx=9997 maxValue=7.37532
maxIdx=9997 maxValue=7.41259
maxIdx=9997 maxValue=7.40084
maxIdx=9997 maxValue=7.41554
maxIdx=9997 maxValue=7.4043
maxIdx=9997 maxValue=7.42162
maxIdx=9997 maxValue=7.4235
Something is really wrong here. This is what some of the labels look like, with blank at the end:
0.163285 0.0443605 -1.00704 -1.24982 1.09511 0.837985 -0.0770629 1.44846 1.47266 -0.456674 -0.922446 -1.00534 0.0227617 7.35806
0.00368934 -0.0220936 -0.0079521 -0.0204602 0.000875052 0.17069 0.0406591 -1.00935 -1.25585 1.08697 0.832637 -0.0810163 1.44864 1.47242 -0.480764 -0.92135 -1.00545 0.0382127 7.36907
Oh no! I'm debugging this ATM. Have a feeling the save/load
function here is not platform agnostic - https://github.com/facebookresearch/wav2letter/blob/master/inference/inference/module/nn/backend/fbgemm/PackedGemmMatrixFP16.h
Could you try replacing the above file with https://gist.github.com/vineelpratap/04a50b06074e055001bccf97bf5d3f3a and give a try.
This works!
Okay cool. I'll fix the master in a day or two!
Thanks so much! I'm excited to try my new streaming convnet model for interactive use.
Thanks @vineelpratap !!
Thanks so much! I'm excited to try my new streaming convnet model for interactive use.
Hi lunixbochs. @lunixbochs Could you share the result about Decoder and Interactive Model? I spent a lot of time to find the difference between Decoder and Interactive Streaming. Even though I set lmweight = 0 and same beamsize / beamsize token. The final result of the same validated dataset is still different. Is it expected behavior?
This is the setting for Decoder: --uselexicon=false \ --wordseparator=_ \ --beamsize=10 \ --beamsizetoken=1 \ --beamthreshold=100 \ --nthread_decoder=1 \ --lm='' \ --lmtype=kenlm \ --lmweight=0 \ --wordscore 0 \ --eosscore 0 \ --silscore 0 \ --unkscore 0 \ --smearing=max \ --maxload -1 \
And this is decoder.json file for Interactive Streaming { "beamSize" : 10, "beamSizeToken" : 1, "beamThreshold" : 100, "usewordpiece" : true, "lmWeight" : 0, "wordScore" : 0, "unkScore" : 0, "silScore" : 0.0, "eosScore" : 0.0, "smearing" : "max", "logAdd" : false, "criterionType" : "CTC" }
Could you please help me?
I modified
simple_streaming_asr_example
to use very simple Greedy CTC decoding instead of a language model (as I don't have a language model that works with my acoustic model).I tested this with your published Streaming ConvNet inference model. It works perfectly.
I trained a new Streaming Convnets model, with a new 10k token set.
It works great with the wav2letter Test binary.
I exported it with
streaming_tds_model_converter --am 021_model_last.bin
using the latest master: 8c56179b0c03c30412779529670a4036c7aae2b9I took the exported model directory and ran it against a wav file:
The greedy CTC decoder reports that the "best class" for each frame is 9997, which is the CTC blank token. So there's no output. If I run the same modified (Greedy CTC)
simple_streaming_asr_example
command, with the same input wave file, with your published model, I get a reasonable transcription.If I run the same wave file with Test against both models, it works fine.
Now! If I re-export the Facebook Streaming ConvNet example model from here:
Then use it with
simple_streaming_asr_example
, I have the same problem. All frames report blank token as most probable.What am I missing? Either the exporter is broken, or my flags are wrong?
Here's the output log from running
streaming_tds_model_converter
on your pre-trained model: