flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Error when running streaming_tds_model_converter #855

Closed tetiana-myronivska closed 4 years ago

tetiana-myronivska commented 4 years ago

Question

Hi, I am running the serialization of the model that we finetuned on our data on top of pre-trained TDS CTC model (librispeech) and running into the error that says "F1008 06:12:20.592749 3899 StreamingTDSModelConverter.cpp:246] Unsupported LayerNorm axis: must be {1, 2} for streaming". Do you know what might be the issue here?

Here is the command that I am running for streaming_tds_model_converter

/root/wav2letter/build_tools_test/tools/streaming_tds_model_converter \
-am  /root/w2l/001_model_last.bin \
--outdir /root/w2l/

And here is the error

I1008 06:12:18.792176  3899 StreamingTDSModelConverter.cpp:174] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/w2l/001_model_last.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=am_tds_ctc.arch; --archdir=/root/w2l/librispeech/2020-09-23_tds_ctc; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=32; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/root/w2l/configs/train/train_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=10000000000; --itersave=true; --labelsmooth=0; --leftWindowSize=50; --lexicon=/root/w2l/librispeech/2020-09-23_tds_ctc/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=0; --localnrmlrightctx=0; --logadd=false; --lr=0.050000000000000003; --lr_decay=10000; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0.001; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=10; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=400; --rightWindowSize=50; --rndv_filepath=; --rundir=/root/w2l/jobs; --runname=am_tds_ctc_librispeech_epoch_9; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=7804; --surround=; --tag=; --target=ltr; --test=; --tokens=librispeech-train-all-unigram-10000.tokens; --tokensdir=/root/w2l/librispeech/2020-09-23_tds_ctc; --train=/root/data/w2v/segments/train.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/root/data/w2v/segments/dev.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=64; --outdir=/root/w2l/; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
I1008 06:12:18.798266  3899 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998
Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2
Skipping View module: V -1 NFEAT 1 0
Skipping Dropout module: DO 0.0
F1008 06:12:20.592749  3899 StreamingTDSModelConverter.cpp:246] Unsupported LayerNorm axis: must be {1, 2} for streaming
*** Check failure stack trace: ***
    @     0x7f62f45d30cd  google::LogMessage::Fail()
    @     0x7f62f45d4f33  google::LogMessage::SendToLog()
    @     0x7f62f45d2c28  google::LogMessage::Flush()
    @     0x7f62f45d5999  google::LogMessageFatal::~LogMessageFatal()
    @     0x5584f34282ce  main
    @     0x7f62ec420b97  __libc_start_main
    @     0x5584f34978da  _start
Aborted (core dumped)

Additional Context

Arch file

SAUG 80 27 2 100 1.0 2
V -1 NFEAT 1 0
C2 1 10 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.05 2400
TDS 10 21 80 0.1 2400
TDS 10 21 80 0.1 2400
C2 10 14 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 14 21 80 0.15 3360
TDS 14 21 80 0.15 3360
TDS 14 21 80 0.15 3360
TDS 14 21 80 0.15 3360
TDS 14 21 80 0.15 3360
TDS 14 21 80 0.15 3360
C2 14 18 21 1 2 1 -1 -1
R
DO 0.0
LN 0 1 2
TDS 18 21 80 0.15 4320
TDS 18 21 80 0.15 4320
TDS 18 21 80 0.15 4320
TDS 18 21 80 0.15 4320
TDS 18 21 80 0.2 4320
TDS 18 21 80 0.2 4320
TDS 18 21 80 0.25 4320
TDS 18 21 80 0.25 4320
TDS 18 21 80 0.25 4320
TDS 18 21 80 0.25 4320
V 0 1440 1 0
RO 1 0 3 2
L 1440 NLABEL
tlikhomanenko commented 4 years ago

cc @vineelpratap but you do layer norm LN 0 1 2, which means across time also (0-axis is time). This is impossible to do because in streaming ASR you don't have the full input, so you cannot do this kind of normalization. Here is the example for our streaming model https://github.com/facebookresearch/wav2letter/blob/master/recipes/streaming_convnets/librispeech/am_500ms_future_context.arch, you can see that there is no LN across 0-dim (time).

tetiana-myronivska commented 4 years ago

I see it now. Thank you, it makes total sense. @vineelpratap @tlikhomanenko is there a way to serialize the model with the architecture that I have, or would I need to re-train with a different architecture? Basically, what I am trying to do is to use the trained model for inference in the python environment.

tumusudheer commented 4 years ago

Hi @tlikhomanenko CC @vineelpratap

All the am model architectures here in sota/2019 have LN 0 1 2, which means across time also (0-axis is time) as you said. Does it mean that we can not use any of these architectures for streaming ASR purposes ? Can we change the LN layer from LN 0 1 2 to LN 1 2 as given in the streaming convnets architecture before we start training ?

I'm aware the simple streaming example provided only works for the streaming_convnets model. but if we manage to write streaming code for the other architectures such as transformers or tds_s2s or resnet , can we change the LayerNorm for these architectures and train them to use in streaming ASR purposes ?

Thank you

tetiana-myronivska commented 4 years ago

I ended up re-training the streaming TDS CTC model that got serialized successfully. It performs slightly worse than the non-streaming TDS CTC version, which makes sense given a smaller future context. Closing this issue now.

kerolos commented 4 years ago

I used this architecture am_tds_ctc.arch (https://github.com/facebookresearch/wav2letter/blob/master/recipes/sota/2019/am_arch/am_tds_ctc.arch), furthermore i changed this LN layer from LN 0 1 2 to LN 1 2, but i ended up with this error, and the feature_extractor.bin file is empty. could you please share the .cfg file and .arch, that you used to train this model "TDS CTC" ? . i really appreciate your help. thanks in advance @tetiana-myronivska

Error: I1015 20:59:54.687757 21882 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998 Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2 Skipping View module: V -1 NFEAT 1 0 Skipping Dropout module: DO 0.1 Skipping Dropout module: DO 0.2 Skipping Dropout module: DO 0.3 Skipping View module: V 0 1440 1 0 Skipping Reorder module: RO 1 0 3 2 I1015 20:59:59.256208 21882 StreamingTDSModelConverter.cpp:289] Serializing acoustic model to '/var/data/training/convert_models/sota_2019/tds_ctc/acoustic_model.bin' I1015 21:00:01.054636 21882 StreamingTDSModelConverter.cpp:301] Writing tokens file to '/var/data/training/convert_models/sota_2019/tds_ctc/tokens.txt' I1015 21:00:01.055555 21882 StreamingTDSModelConverter.cpp:328] Serializing feature extraction model to '/var/data/training/convert_models/sota_2019/tds_ctc/feature_extractor.bin' F1015 21:00:01.135741 21882 StreamingTDSModelConverter.cpp:332] Local Norm should be used for online inference Check failure stack trace: @ 0x7f95b44d20cd google::LogMessage::Fail() @ 0x7f95b44d3f33 google::LogMessage::SendToLog() @ 0x7f95b44d1c28 google::LogMessage::Flush() @ 0x7f95b44d4999 google::LogMessageFatal::~LogMessageFatal() @ 0x55680e93f4e0 main @ 0x7f95affe3b97 __libc_start_main @ 0x55680e9ad80a _start Aborted (core dumped)

tlikhomanenko commented 4 years ago

Hi @tlikhomanenko CC @vineelpratap

All the am model architectures here in sota/2019 have LN 0 1 2, which means across time also (0-axis is time) as you said. Does it mean that we can not use any of these architectures for streaming ASR purposes ? Can we change the LN layer from LN 0 1 2 to LN 1 2 as given in the streaming convnets architecture before we start training ?

I'm aware the simple streaming example provided only works for the streaming_convnets model. but if we manage to write streaming code for the other architectures such as transformers or tds_s2s or resnet , can we change the LayerNorm for these architectures and train them to use in streaming ASR purposes ?

Thank you

Yep, you can simple change LN 0 1 2 to LN 1 2 and retrain the model, this could hurt a bit performance, but should be trained well too.

I used this architecture am_tds_ctc.arch (https://github.com/facebookresearch/wav2letter/blob/master/recipes/sota/2019/am_arch/am_tds_ctc.arch), furthermore i changed this LN layer from LN 0 1 2 to LN 1 2, but i ended up with this error, and the feature_extractor.bin file is empty. could you please share the .cfg file and .arch, that you used to train this model "TDS CTC" ? . i really appreciate your help. thanks in advance @tetiana-myronivska

Error: I1015 20:59:54.687757 21882 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998 Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2 Skipping View module: V -1 NFEAT 1 0 Skipping Dropout module: DO 0.1 Skipping Dropout module: DO 0.2 Skipping Dropout module: DO 0.3 Skipping View module: V 0 1440 1 0 Skipping Reorder module: RO 1 0 3 2 I1015 20:59:59.256208 21882 StreamingTDSModelConverter.cpp:289] Serializing acoustic model to '/var/data/training/convert_models/sota_2019/tds_ctc/acoustic_model.bin' I1015 21:00:01.054636 21882 StreamingTDSModelConverter.cpp:301] Writing tokens file to '/var/data/training/convert_models/sota_2019/tds_ctc/tokens.txt' I1015 21:00:01.055555 21882 StreamingTDSModelConverter.cpp:328] Serializing feature extraction model to '/var/data/training/convert_models/sota_2019/tds_ctc/feature_extractor.bin' F1015 21:00:01.135741 21882 StreamingTDSModelConverter.cpp:332] Local Norm should be used for online inference Check failure stack trace: @ 0x7f95b44d20cd google::LogMessage::Fail() @ 0x7f95b44d3f33 google::LogMessage::SendToLog() @ 0x7f95b44d1c28 google::LogMessage::Flush() @ 0x7f95b44d4999 google::LogMessageFatal::~LogMessageFatal() @ 0x55680e93f4e0 main @ 0x7f95affe3b97 __libc_start_main @ 0x55680e9ad80a _start Aborted (core dumped)

You need to have also the --localnrmlleftctx parameter to be > zero, which is used for normalization https://github.com/facebookresearch/wav2letter/blob/v0.2/src/common/Defines.cpp#L72. cc @vineelpratap on more details here.

kerolos commented 4 years ago

thanks for your reply. actually the feature_extractor.bin file is not empty, when i set this parameter (localnrmlleftctx) to 1. But i got this error.

Error in model converter : root@7671e22fb3ef:~/wav2letter/build_tools_test/tools# ./streaming_tds_model_converter --am /var/data/training/training_models/sota_2019/am/tds_ctc_librispeech_workpiece_03LR_05M_4BS_1R_M/001_model_last.bin --outdir /var/data/training/convert_models/sota_2019/tds_ctc/ --localnrmlleftctx 1
I1016 09:35:21.977306 8723 StreamingTDSModelConverter.cpp:174] Gflags after parsing --flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/var/data/training/training_models/sota_2019/am/tds_ctc_librispeech_workpiece_03LR_05M_4BS_1R_M/001_model_last.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=/var/data/training/training_models/sota_2019/am_tds_ctc.arch; --archdir=; --attention=content; --attentionthreshold=2147483647; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=4; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=/var/data/training/training_models/sota_2019/train_am_tds_ctc.cfg; --framesizems=30; --framestridems=10; --gamma=0.5; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=9223372036854775807; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/var/data/librispeech/token_lexicon/wordpiece/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=1; --localnrmlrightctx=0; --logadd=false; --lr=0.29999999999999999; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=1; --maxisz=9223372036854775807; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=8338608; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=0; --minrate=3; --minsil=0; --mintsz=0; --momentum=0.5; --netoptim=sgd; --noresample=false; --nthread=15; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=100; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=0; --rightWindowSize=50; --rndv_filepath=; --rundir=; --runname=/var/data/training/training_models/sota_2019/am/tds_ctc_librispeech_workpiece; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=2; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=200; --surround=; --tag=03LR_05M_4BS_1R_XS; --target=tkn; --test=; --tokens=/var/data/librispeech/token_lexicon/wordpiece/librispeech-train-all-unigram-10000.tokens; --tokensdir=; --train=/var/data/librispeech/lists/train-clean-20.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --usememcache=false; --uselexicon=true; --usewordpiece=true; --valid=dev-clean:/var/data/librispeech/lists/dev-clean.lst; --validbatchsize=-1; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=; --world_rank=0; --world_size=1; --outdir=/var/data/training/convert_models/sota_2019/tds_ctc/; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=; I1016 09:35:21.981513 8723 StreamingTDSModelConverter.cpp:192] Number of classes (network): 9998 Skipping SpecAugment module: SAUG 80 27 2 100 1.0 2 Skipping View module: V -1 NFEAT 1 0 Skipping Dropout module: DO 0.1 Skipping Dropout module: DO 0.2 Skipping Dropout module: DO 0.3 Skipping View module: V 0 1440 1 0 Skipping Reorder module: RO 1 0 3 2 I1016 09:35:26.237170 8723 StreamingTDSModelConverter.cpp:289] Serializing acoustic model to '/var/data/training/convert_models/sota_2019/tds_ctc/acoustic_model.bin' I1016 09:35:27.723588 8723 StreamingTDSModelConverter.cpp:301] Writing tokens file to '/var/data/training/convert_models/sota_2019/tds_ctc/tokens.txt' I1016 09:35:27.724907 8723 StreamingTDSModelConverter.cpp:328] Serializing feature extraction model to '/var/data/training/convert_models/sota_2019/tds_ctc/feature_extractor.bin' I1016 09:35:27.805680 8723 StreamingTDSModelConverter.cpp:344] verifying serialization ... F1016 09:35:29.561218 8723 StreamingTDSModelConverter.cpp:368] [Serialization Error] Mismatched output w2l:9.49972 vs streaming:-0.331217 Check failure stack trace: @ 0x7f3e7fc320cd google::LogMessage::Fail() @ 0x7f3e7fc33f33 google::LogMessage::SendToLog() @ 0x7f3e7fc31c28 google::LogMessage::Flush() @ 0x7f3e7fc34999 google::LogMessageFatal::~LogMessageFatal() @ 0x55e65b0051fe main @ 0x7f3e7b743b97 __libc_start_main @ 0x55e65b07480a _start Aborted (core dumped)

Inference Output: root@6164c36d5658:/# /root/wav2letter/build_tools_test/inference/inference/examples/simple_streaming_asr_example --input_files_base_path /var/data/en/training/convert_models/sota_2019/tds_ctc/ --input_audio_file /var/data/en/training/audio/test_audio/8461-258277-0011.wav Started features model file loading ... Completed features model file loading elapsed time=81166 microseconds

Started acoustic model file loading ... Completed acoustic model file loading elapsed time=3715 milliseconds

Started tokens file loading ... Completed tokens file loading elapsed time=764 microseconds

Tokens loaded - 9998 tokens Started decoder options file loading ... Completed decoder options file loading elapsed time=55 microseconds

Started create decoder ... [Letters] 9998 tokens loaded. [Words] 89612 words loaded. Completed create decoder elapsed time=1890 milliseconds

Started converting audio input file=/var/data/en/training/audio/test_audio/8461-258277-0011.wav to text... ... Creating LexiconDecoder instance.

start (msec), end(msec), transcription

0,1000, 1000,2000, 2000,2421,reflect Completed converting audio input file=/var/data/en/training/audio/test_audio/8461-258277-0011.wav to text... elapsed time=417 milliseconds

tlikhomanenko commented 4 years ago

cc @vineelpratap @avidov