Difference between sota/2019/am_tds_ctc and streaming_convnets/librispeech/am_500ms_future_context models?

abhinavkulkarni commented 3 years ago

Question

Hi,

Other than the architecture what is the difference between sota/2019/am_tds_ctc and streaming_convnets/librispeech/am_500ms_future_context models?

I am able to convert the latter to FBGEMM streaming convnet using the conversion tool however, I got the following error when I tried converting the former:

I1115 13:35:06.643517  7721 StreamingTDSModelConverter.cpp:152] [Network] Reading acoustic model from /home/w2luser/w2l/am/am_tds_ctc_librispeech_dev_other.bin
I1115 13:35:07.701886  7721 StreamingTDSModelConverter.cpp:157] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
    (0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
    (1): View (-1 80 1 0)
    (2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (3): ReLU
    (4): Dropout (0.000000)
    (5): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (12): ReLU
    (13): Dropout (0.000000)
    (14): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (22): ReLU
    (23): Dropout (0.000000)
    (24): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (35): View (0 1440 1 0)
    (36): Reorder (1,0,3,2)
    (37): Linear (1440->9998) (with bias)
I1115 13:35:07.702139  7721 StreamingTDSModelConverter.cpp:158] [Criterion] ConnectionistTemporalClassificationCriterion
I1115 13:35:07.702153  7721 StreamingTDSModelConverter.cpp:159] [Network] Number of params: 203394122
I1115 13:35:07.702214  7721 StreamingTDSModelConverter.cpp:165] [Network] Updating flags from config file: /home/w2luser/w2l/am/am_tds_ctc_librispeech_dev_other.bin
I1115 13:35:07.702975  7721 StreamingTDSModelConverter.cpp:174] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/home/w2luser/w2l/am/am_tds_ctc_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_arch/am_tds_ctc.arch; --archdir=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1500; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/home/w2luser/w2l/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.29999999999999999; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=10; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/100_rndv; --rundir=[...]; --runname=am_tds_ctc_librispeech; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=200; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/w2luser/w2l/am; --train=[DATA_DST]/lists/train-clean-100.lst,[DATA_DST]/lists/train-clean-360.lst,[DATA_DST]/lists/train-other-500.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev-clean:[DATA_DST]/lists/dev-clean.lst,dev-other:[DATA_DST]/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=64; --outdir=/home/w2luser/models; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; 
I1115 13:35:07.736187  7721 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping View module: V -1 NFEAT 1 0
Skipping Dropout module: DO 0.0
F1115 13:35:07.754354  7721 StreamingTDSModelConverter.cpp:246] Unsupported LayerNorm axis: must be {1, 2} for streaming
*** Check failure stack trace: ***
    @     0x7fe42662e1c3  google::LogMessage::Fail()
    @     0x7fe42663325b  google::LogMessage::SendToLog()
    @     0x7fe42662debf  google::LogMessage::Flush()
    @     0x7fe42662e6ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x561167388a79  main
    @     0x7fe425fd6cb2  __libc_start_main
    @     0x561167386ade  _start

I was under an impression that any TDS CTC model could be converted to FBGEMM streaming convnets.

Thanks!

vineelpratap commented 3 years ago

Hi, To make the architecture streamable, you would have to make changed to TDS+CTC architecture. Using plain TDS+CTC architecture won't work for streaming use case...

Here are the main changes ...

Remove normalization over time in modules - LN, TDS - See the changed architecture file in streaming_convnets recipe
--localnrmlleftctx=300

abhinavkulkarni commented 3 years ago

Thanks, @vineelpratap.

For LN (LayerNorm), can I simply remove the time dimension and reuse the parameters of the rest of the other layers as is?
It seems providing --localnrmlleftctx=300 is moot for TDS+CTC architecture since LocalNorm (not to be confused with LayerNorm) isn't used anywhere in the model. Is my understanding correct?

I did the above two (converted LN 0 1 2 to LN 1 2 in the archfile and provided --localnrmlleftctx=300 in the config file) and ran the streaming TDS module conversion script and was able to obtain an acoustic_module.bin, however, I get the following error. It looks like the output from Flashlight and FBGEMM model isn't matching.

What additional changes need to be done?

Thanks!

/home/w2luser/Projects/wav2letter/cmake-build-debug-fbgemm/tools/streaming_tds_model_converter --am /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin --outdir /home/w2luser/models --flagsfile /home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg --logtostderr=1
I1117 14:52:10.108525 53902 StreamingTDSModelConverter.cpp:152] [Network] Reading acoustic model from /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin
I1117 14:52:10.856041 53902 StreamingTDSModelConverter.cpp:157] [Network] Sequential [input -> (0) -> (1) -> (2) -> (3) -> (4) -> (5) -> (6) -> (7) -> (8) -> (9) -> (10) -> (11) -> (12) -> (13) -> (14) -> (15) -> (16) -> (17) -> (18) -> (19) -> (20) -> (21) -> (22) -> (23) -> (24) -> (25) -> (26) -> (27) -> (28) -> (29) -> (30) -> (31) -> (32) -> (33) -> (34) -> (35) -> (36) -> (37) -> output]
    (0): SpecAugment ( W: 80, F: 27, mF: 2, T: 100, p: 1, mT: 2 )
    (1): View (-1 80 1 0)
    (2): Conv2D (1->10, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (3): ReLU
    (4): Dropout (0.000000)
    (5): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (6): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (7): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (8): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (9): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (10): Time-Depth Separable Block (21, 240, 10) [800 -> 2400 -> 800]
    (11): Conv2D (10->14, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (12): ReLU
    (13): Dropout (0.000000)
    (14): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (15): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (16): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (17): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (18): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (19): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (20): Time-Depth Separable Block (21, 240, 14) [1120 -> 3360 -> 1120]
    (21): Conv2D (14->18, 21x1, 2,1, SAME,SAME, 1, 1) (with bias)
    (22): ReLU
    (23): Dropout (0.000000)
    (24): LayerNorm ( axis : { 0 1 2 } , size : -1)
    (25): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (26): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (27): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (28): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (29): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (30): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (31): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (32): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (33): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (34): Time-Depth Separable Block (21, 240, 18) [1440 -> 4320 -> 1440]
    (35): View (0 1440 1 0)
    (36): Reorder (1,0,3,2)
    (37): Linear (1440->9998) (with bias)
I1117 14:52:10.856158 53902 StreamingTDSModelConverter.cpp:158] [Criterion] ConnectionistTemporalClassificationCriterion
I1117 14:52:10.856165 53902 StreamingTDSModelConverter.cpp:159] [Network] Number of params: 203394122
I1117 14:52:10.856205 53902 StreamingTDSModelConverter.cpp:165] [Network] Updating flags from config file: /data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin
I1117 14:52:10.856637 53902 StreamingTDSModelConverter.cpp:174] Gflags after parsing 
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/data/podcaster/model/wav2letter/am_tds_ctc_librispeech_dev_other/am_tds_ctc_librispeech_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_arch/am_tds_ctc.arch; --archdir=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/home/w2luser/Projects/wav2letter/recipes/models/sota/2019/librispeech/train_am_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1500; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/home/w2luser/w2l/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.29999999999999999; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=10; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=/checkpoint/qiantong/ls_200M/do0.15_l5.6.10_mid3.0_incDO/100_rndv; --rundir=[...]; --runname=am_tds_ctc_librispeech; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=200; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/home/w2luser/w2l/am; --train=[DATA_DST]/lists/train-clean-100.lst,[DATA_DST]/lists/train-clean-360.lst,[DATA_DST]/lists/train-other-500.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev-clean:[DATA_DST]/lists/dev-clean.lst,dev-other:[DATA_DST]/lists/dev-other.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=64; --outdir=/home/w2luser/models; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; 
I1117 14:52:10.876313 53902 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping View module: V -1 NFEAT 1 0
Skipping Dropout module: DO 0.0
Skipping Dropout module: DO 0.0
Skipping Dropout module: DO 0.0
Skipping View module: V 0 1440 1 0
Skipping Reorder module: RO 1 0 3 2
I1117 14:52:26.342659 53902 StreamingTDSModelConverter.cpp:289] Serializing acoustic model to '/home/w2luser/models/acoustic_model.bin'
I1117 14:52:36.974776 53902 StreamingTDSModelConverter.cpp:301] Writing tokens file to '/home/w2luser/models/tokens.txt'
I1117 14:52:36.977149 53902 StreamingTDSModelConverter.cpp:328] Serializing feature extraction model to '/home/w2luser/models/feature_extractor.bin'
I1117 14:52:36.980671 53902 StreamingTDSModelConverter.cpp:344] verifying serialization ...
F1117 14:52:37.219713 53902 StreamingTDSModelConverter.cpp:368] [Serialization Error] Mismatched output w2l:2.72653 vs streaming:12.5302
*** Check failure stack trace: ***
    @     0x7f4f9d8441c3  google::LogMessage::Fail()
    @     0x7f4f9d84925b  google::LogMessage::SendToLog()
    @     0x7f4f9d843ebf  google::LogMessage::Flush()
    @     0x7f4f9d8446ef  google::LogMessageFatal::~LogMessageFatal()
    @     0x55f014b84301  main
    @     0x7f4f9d1eccb2  __libc_start_main
    @     0x55f014b80ade  _start

Process finished with exit code 134 (interrupted by signal 6: SIGABRT)

abhinavkulkarni commented 3 years ago

Hey @vineelpratap,

Do you have any advice with regards to the above?

Thanks!

trangtv57 commented 3 years ago

hi @abhinavkulkarni

Layernorm with time or not will be change the result of model (because pretrained model has trained with layernorm with time) but you can use it (layernorm without time) normaly because it's not change the shape of output after this layer. But your error may be from w2l check result after model after serialization with normal (it's sure be diffference) and make disurpt. If you still want to use layernorm without time i think just edit code to skip this check although its just a tricked and result will be get still bad.
option --localnrmlleftctx=300 will be used when extract feature (norm mfsc) from code: https://github.com/facebookresearch/wav2letter/blob/v0.2/src/data/Featurize.cpp#L106 so like this answer in 1, it's will make difference result because your model train with feature norm with left context, and when you change it your feature you generated will difference like your model has trained before. conclude, so if you still want to use am_tds_ctc, it will not be good for streaming right now.

flashlight / wav2letter

Difference between sota/2019/am_tds_ctc and streaming_convnets/librispeech/am_500ms_future_context models? #888

Question