Open abhinavkulkarni opened 3 years ago
Hi,
To make the architecture streamable, you would have to make changed to TDS+CTC architecture. Using plain TDS+CTC
architecture won't work for streaming use case...
Here are the main changes ...
LN
, TDS
- See the changed architecture file in streaming_convnets
recipe--localnrmlleftctx=300
Thanks, @vineelpratap.
LN
(LayerNorm
), can I simply remove the time dimension and reuse the parameters of the rest of the other layers as is?--localnrmlleftctx=300
is moot for TDS+CTC
architecture since LocalNorm
(not to be confused with LayerNorm
) isn't used anywhere in the model. Is my understanding correct?I did the above two (converted LN 0 1 2
to LN 1 2
in the archfile and provided --localnrmlleftctx=300
in the config file) and ran the streaming TDS module conversion script and was able to obtain an acoustic_module.bin
, however, I get the following error. It looks like the output from Flashlight and FBGEMM model isn't matching.
What additional changes need to be done?
Thanks!
/home/w2luser/Projects/wav2letter/cmake-build-debug-fbgemm/tools/streaming_tds_model_converter --am /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin --outdir /home/w2luser/models --flagsfile /home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg --logtostderr=1
I1117 14:52:10.108525 53902 StreamingTDSModelConverter.cpp:152] [Network] Reading acoustic model from /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin
I1117 14:52:10.856041 53902 StreamingTDSModelConverter.cpp:157] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
(0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
(1): View (-1 80 1 0)
(2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(3): ReLU
(4): Dropout (0.000000)
(5): LayerNorm ( axis : { 0 1 2 } , size : -1)
(6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
(11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(12): ReLU
(13): Dropout (0.000000)
(14): LayerNorm ( axis : { 0 1 2 } , size : -1)
(15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
(21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
(22): ReLU
(23): Dropout (0.000000)
(24): LayerNorm ( axis : { 0 1 2 } , size : -1)
(25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
(35): View (0 1440 1 0)
(36): Reorder (1,0,3,2)
(37): Linear (1440->9998) (with bias)
I1117 14:52:10.856158 53902 StreamingTDSModelConverter.cpp:158] [Criterion] ConnectionistTemporalClassificationCriterion
I1117 14:52:10.856165 53902 StreamingTDSModelConverter.cpp:159] [Network] Number of params: 203394122
I1117 14:52:10.856205 53902 StreamingTDSModelConverter.cpp:165] [Network] Updating flags from config file: /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin
I1117 14:52:10.856637 53902 StreamingTDSModelConverter.cpp:174] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_arch/am_tds_ctc.arch; --archdir=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1500; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/home/w2luser/w2l/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.29999999999999999; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=10; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/100_rndv; --rundir=[...]; --runname=am_tds_ctc_librispeech; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=200; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/w2luser/w2l/am; --train=[DATA_DST]/lists/train-clean-100.lst,[DATA_DST]/lists/train-clean-360.lst,[DATA_DST]/lists/train-other-500.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev-clean:[DATA_DST]/lists/dev-clean.lst,dev-other:[DATA_DST]/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=64; --outdir=/home/w2luser/models; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I1117 14:52:10.876313 53902 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping View module: V -1 NFEAT 1 0
Skipping Dropout module: DO 0.0
Skipping Dropout module: DO 0.0
Skipping Dropout module: DO 0.0
Skipping View module: V 0 1440 1 0
Skipping Reorder module: RO 1 0 3 2
I1117 14:52:26.342659 53902 StreamingTDSModelConverter.cpp:289] Serializing acoustic model to '/home/w2luser/models/acoustic_model.bin'
I1117 14:52:36.974776 53902 StreamingTDSModelConverter.cpp:301] Writing tokens file to '/home/w2luser/models/tokens.txt'
I1117 14:52:36.977149 53902 StreamingTDSModelConverter.cpp:328] Serializing feature extraction model to '/home/w2luser/models/feature_extractor.bin'
I1117 14:52:36.980671 53902 StreamingTDSModelConverter.cpp:344] verifying serialization ...
F1117 14:52:37.219713 53902 StreamingTDSModelConverter.cpp:368] [Serialization Error] Mismatched output w2l:2.72653 vs streaming:12.5302
*** Check failure stack trace: ***
@ 0x7f4f9d8441c3 google::LogMessage::Fail()
@ 0x7f4f9d84925b google::LogMessage::SendToLog()
@ 0x7f4f9d843ebf google::LogMessage::Flush()
@ 0x7f4f9d8446ef google::LogMessageFatal::~LogMessageFatal()
@ 0x55f014b84301 main
@ 0x7f4f9d1eccb2 __libc_start_main
@ 0x55f014b80ade _start
Process finished with exit code 134 (interrupted by signal 6: SIGABRT)
Hey @vineelpratap,
Do you have any advice with regards to the above?
Thanks!
hi @abhinavkulkarni
Question
Hi,
Other than the architecture what is the difference between sota/2019/am_tds_ctc and streaming_convnets/librispeech/am_500ms_future_context models?
I am able to convert the latter to FBGEMM streaming convnet using the conversion tool however, I got the following error when I tried converting the former:
I was under an impression that any TDS CTC model could be converted to FBGEMM streaming convnets.
Thanks!