Inference own trained model on Docker CPU image

sakares commented 4 years ago

Thanks so much for an amazing repository. Currently, I have done on training my own acoustic model from here and it's a simple architecture from the tutorial here: wav2letter/tutorials/1-librispeech_clean/network.arch

Also, I have reproduced this step successfully from AWS S3 acoustic_model.bin

One thing I am struggling with on how to use my own trained acoustic model from the following code:

./simple_streaming_asr_example --input_files_base_path ~/w2l/model/ --input_audio_file ~/w2l/test_audio.wav --acoustic_module_file ~/w2l/model/my_model.bin

gave result

Started features model file loading ... 
Completed features model file loading elapsed time=3780 microseconds

Started acoustic model file loading ... 
terminate called after throwing an instance of 'cereal::Exception'
  what():  Error while trying to deserialize a polymorphic pointer. Could not find type id 3
Aborted

Is it different training recipe between acoustic_model.bin and my own AM model?

How can I inference with my own AM model?

avidov commented 4 years ago

Thank you Sakares, I am glad that you find it useful.

How did you create the serialized model that you are loading?

sakares commented 4 years ago

@avidov I created the serialized model by ran the following command from Docker-GPU

/root/wav2letter/build/Train train --flagsfile /root/wav2letter/tutorials/librispeech_clean/train.cfg

With a given train.cfg here

--datadir=/root/w2l_dataset/lists
--rundir=/root/w2l_dataset
--archdir=/root/wav2letter/tutorials/1-librispeech_clean/
--train=train-clean-100.lst
--valid=dev-clean.lst
--input=flac
--arch=network.arch
--tokens=/root/w2l_dataset/am/tokens.txt
--lexicon=/root/w2l_dataset/am/lexicon.txt
--criterion=ctc
--lr=0.1
--maxgradnorm=1.0
--replabel=1
--surround=|
--onorm=target
--sqnorm=true
--mfsc=true
--filterbanks=40
--nthread=8
--batchsize=8
--runname=v2
--iter=5

It turned out I could run successfully with Decode with my model ~/w2l/model/my_model.bin on decode.cfg

/root/wav2letter/build/Decoder --flagsfile w2l_dataset/decode.cfg --logtostderr=1 --minloglevel=0

with result like

|T|: but now nothing could hold me back
|P|: but again to that
[sample: test-clean-8463-294828-0004, WER: 85.7143%, LER: 70.5882%, slice WER: 85.7143%, slice LER: 70.5882%, decoded samples (thread 0): 1]
|T|: stuff it into you his belly counselled him
|P|: but it is you
[sample: test-clean-1089-134686-0001, WER: 75%, LER: 71.4286%, slice WER: 75%, slice LER: 71.4286%, decoded samples (thread 3): 1]
|T|: so it is said anders
|P|: so sir
[sample: test-clean-7021-85628-0017, WER: 80%, LER: 70%, slice WER: 80%, slice LER: 70%, decoded samples (thread 2): 1]
|T|: tied to a woman
|P|: at that
[sample: test-clean-121-121726-0013, WER: 100%, LER: 80%, slice WER: 90.9091%, slice LER: 73.4694%, decoded samples (thread 0): 2]
|T|: then he comes to the beak of it
|P|: any one of the peace
[sample: test-clean-1188-133604-0006, WER: 87.5%, LER: 61.2903%, slice WER: 81.25%, slice LER: 67.1233%, decoded samples (thread 3): 2]
|T|: that is a very fine cap you have he said
|P|: it is only one of the head
[sample: test-clean-7021-85628-0016, WER: 90%, LER: 55%, slice WER: 90%, slice LER: 55%, decoded samples (thread 1): 1]
------
[Decode mini-test-clean.lst (6 samples) in 3.08623s (actual decoding time 0.256s/sample) -- WER: 85.7143, LER: 66.4835]

Even I have re- "cmake .. [params so on] && make -j8" But I got an error when I ran with

./simple_streaming_asr_example --input_files_base_path ~/w2l/model/ --input_audio_file ~/w2l/test_audio.wav --acoustic_module_file ~/w2l/model/my_model.bin

Not sure if the simple_streaming_asr_example.cpp has a different model serialization from train.cpp or not?

vineelpratap commented 4 years ago

Not sure if the simple_streaming_asr_example.cpp has a different model serialization from train.cpp or not

Yes, it expects a different serialization format. If you train the models using the recipe from out paper, we provide an automatic tool to convert the model to a serialization format which the streaming ASR example can load - See StreamingTDSModelConverter.cpp in https://github.com/facebookresearch/wav2letter/tree/master/tools.

Note that the tool outputs acoustic model, feature extraction model, lexicon and token set. We also expect you to provide language model file and the optimal decoder param settings in --input_files_base_path additionally. You can try to download our model from here to see all the things that are needed - https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples

sakares commented 4 years ago

If you train the models using the recipe from out paper, we provide an automatic tool to convert the model to a serialization format which the streaming ASR example can load - See StreamingTDSModelConverter.cpp

@vineelpratap Yes! It is what I was thinking about. Unfortunately, I have encountered many times failed built for Tools/StreamingTDSModelConverter on the docker wav2letter/wav2letter:cpu-latest

Let me figure out and update soon

sakares commented 4 years ago

Update: After trying to make the StreamingTDSModelConverter, I have made a workaround step by copying inference directory to src with the following script:

export KENLM_ROOT_DIR=/root/kenlm && \
cd /root/ && \
cp -rf wav2letter/inference wav2letter/src/ && \
mkdir wav2letter/build_tools_test && cd wav2letter/build_tools_test && \
cmake .. -DCMAKE_BUILD_TYPE=Release -DW2L_LIBRARIES_USE_CUDA=OFF -DW2L_BUILD_INFERENCE=ON -DW2L_BUILD_TOOLS=ON && \
make streaming_tds_model_converter

since the wav2letter/tools/StreamingTDSModelConverter.cpp notifed about missing feature.h during making phase.

Note: about to reproduce the streaming_convnets on the docker wav2letter/wav2letter:cpu-latest

sakares commented 4 years ago

Updated: I have been able to reproduce the streaming_convnets and converted the model with StreamingTDSModelConverter for inference successfully

Note: I built the StreamingTDSConverter with the workaround script above. Feel free to re-open this issue if necessary.

cri5Castro commented 4 years ago

what about the new interactive streaming example it stills only compatible with tds+ctc models?

sakares commented 4 years ago

I think it would support any acoustic and feature extractor model which could be either TDS, CTC, ASG or else but you need to write the script to convert the proper acoustic and feature for simple asr example or interactive asr example

The StreamingTDSModelConverter code base could be a good starting point

avidov commented 4 years ago

Inference modules are a bit simpler than training modules. They do not need support for auto grad, back propagation and GPU. Instead they architected for streaming and backend flexibility. Since the modules are a bit different, the serialized format is a bit different as well. As Sakares mentioned above we have the streaming_tds_model_converter tool to convert serialized training module to a module to be used for inference. See usage details at: https://github.com/facebookresearch/wav2letter/blob/2ee1d58d6c39bbe583773185dd6df90cb4d4c474/tools/README.md#streaming-tds-model-conversion-for-running-inference-pipeline

mironnn commented 4 years ago

Update: After trying to make the StreamingTDSModelConverter, I have made a workaround step by copying inference directory to src with the following script:
export KENLM_ROOT_DIR=/root/kenlm && \
cd /root/ && \
cp -rf wav2letter/inference wav2letter/src/ && \
mkdir wav2letter/build_tools_test && cd wav2letter/build_tools_test && \
cmake .. -DCMAKE_BUILD_TYPE=Release -DW2L_LIBRARIES_USE_CUDA=OFF -DW2L_BUILD_INFERENCE=ON -DW2L_BUILD_TOOLS=ON && \
make streaming_tds_model_converter
since the wav2letter/tools/StreamingTDSModelConverter.cpp notifed about missing feature.h during making phase.

Note: about to reproduce the streaming_convnets on the docker wav2letter/wav2letter:cpu-latest

Thank you for your script! After the compilation I had the error: streaming_tds_model_converter: error while loading shared libraries: libmkl_rt.so: cannot open shared object file: No such file or directory This thing helped, maybe it will help for somebody: export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH

But, I downloaded pre-trained streaming_convnets model and tried to convert it, but received error. Could you please share your experience, how did you convert the model?


root@a5d94a65e709:~/wav2letter/build_tools_test/tools# ./streaming_tds_model_converter -am /root/host/model/ --outdir /root/host/out_model/
E0723 11:47:21.968611  3031 Serial.h:77] Error while loading "/root/host/model/": basic_filebuf::underflow error reading the file: iostream error
E0723 11:47:22.974591  3031 Serial.h:77] Error while loading "/root/host/model/": basic_filebuf::underflow error reading the file: iostream error
E0723 11:47:24.978430  3031 Serial.h:77] Error while loading "/root/host/model/": basic_filebuf::underflow error reading the file: iostream error
E0723 11:47:28.983088  3031 Serial.h:77] Error while loading "/root/host/model/": basic_filebuf::underflow error reading the file: iostream error
E0723 11:47:36.985092  3031 Serial.h:77] Error while loading "/root/host/model/": basic_filebuf::underflow error reading the file: iostream error
E0723 11:47:52.987134  3031 Serial.h:77] Error while loading "/root/host/model/": basic_filebuf::underflow error reading the file: iostream error
terminate called after throwing an instance of 'std::__ios_failure'
  what():  basic_filebuf::underflow error reading the file: iostream error
*** Aborted at 1595504872 (unix time) try "date -d @1595504872" if you are using GNU date ***
PC: @     0x7fac3cd69e97 gsignal
*** SIGABRT (@0xbd7) received by PID 3031 (TID 0x7fac4819e840) from PID 3031; stack trace: ***
    @     0x7fac3e570890 (unknown)
    @     0x7fac3cd69e97 gsignal
    @     0x7fac3cd6b801 abort
    @     0x7fac3d75e957 (unknown)
    @     0x7fac3d764ae6 (unknown)
    @     0x7fac3d764b21 std::terminate()
    @     0x7fac3d764da9 __cxa_rethrow
    @     0x55aa47b2a91b (unknown)
    @     0x7fac3cd4cb97 __libc_start_main
    @     0x55aa47b9697a (unknown)
Aborted

tlikhomanenko commented 4 years ago

Here you need to specify full path to the model, not just dir where it is - --am /root/host/model/model.bin

mironnn commented 4 years ago

Here you need to specify full path to the model, not just dir where it is - --am /root/host/model/model.bin

Thank you for your help.

I have such error Invalid dictionary filepath specified tokens.txt

root@1e4cb9f61569:~/host# ./streaming_tds_model_converter -am /root/host/model/am_500ms_future_context_dev_other.bin --outdir /root/host/out_model/
./streaming_tds_model_converter: error while loading shared libraries: libmkl_rt.so: cannot open shared object file: No such file or directory
root@1e4cb9f61569:~/host# export LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2018.5.274/linux/mkl/lib/intel64:$LD_IBRARY_PATH
root@1e4cb9f61569:~/host# ./streaming_tds_model_converter -am /root/host/model/am_500ms_future_context_dev_other.bin --outdir /root/host/out_model/
I0724 09:58:19.420972    18 StreamingTDSModelConverter.cpp:174] Gflags after parsing
--flagfile=; --fromenv=; --tryfromenv=; --undefok=; --tab_completion_columns=80; --tab_completion_word=; --help=false; --helpfull=false; --helpmatch=; --helpon=; --helppackage=false; --helpshort=false; --helpxml=false; --version=false; --adambeta1=0.90000000000000002; --adambeta2=0.999; --am=/root/host/model/am_500ms_future_context_dev_other.bin; --am_decoder_tr_dropout=0; --am_decoder_tr_layerdrop=0; --am_decoder_tr_layers=1; --arch=arch.txt; --archdir=; --attention=content; --attentionthreshold=0; --attnWindow=no; --attnconvchannel=0; --attnconvkernel=0; --attndim=0; --batchsize=8; --beamsize=2500; --beamsizetoken=250000; --beamthreshold=25; --blobdata=false; --channels=1; --criterion=ctc; --critoptim=sgd; --datadir=; --dataorder=input; --decoderattnround=1; --decoderdropout=0; --decoderrnnlayer=1; --decodertype=wrd; --devwin=0; --emission_dir=; --emission_queue_size=3000; --enable_distributed=true; --encoderdim=0; --eosscore=0; --eostoken=false; --everstoredb=false; --fftcachesize=1; --filterbanks=80; --flagsfile=; --framesizems=25; --framestridems=10; --gamma=1; --gumbeltemperature=1; --input=flac; --inputbinsize=100; --inputfeeding=false; --isbeamdump=false; --iter=1000000; --itersave=false; --labelsmooth=0; --leftWindowSize=50; --lexicon=/checkpoint/antares/wav2letter/recipes/models/seq2seq_tds/librispeech/am/librispeech-train+dev-unigram-10000-nbest10.lexicon; --linlr=-1; --linlrcrit=-1; --linseg=0; --lm=; --lm_memory=5000; --lm_vocab=; --lmtype=kenlm; --lmweight=0; --localnrmlleftctx=300; --localnrmlrightctx=0; --logadd=false; --lr=0.40000000000000002; --lr_decay=9223372036854775807; --lr_decay_step=9223372036854775807; --lrcosine=false; --lrcrit=0; --max_devices_per_node=8; --maxdecoderoutputlen=200; --maxgradnorm=0.5; --maxisz=33000; --maxload=-1; --maxrate=10; --maxsil=50; --maxtsz=9223372036854775807; --maxword=-1; --melfloor=1; --memstepsize=10485760; --mfcc=false; --mfcccoeffs=13; --mfsc=true; --minisz=200; --minrate=3; --minsil=0; --mintsz=2; --momentum=0; --netoptim=sgd; --noresample=false; --nthread=6; --nthread_decoder=1; --nthread_decoder_am_forward=1; --numattnhead=8; --onorm=target; --optimepsilon=1e-08; --optimrho=0.90000000000000002; --outputbinsize=5; --pctteacherforcing=100; --pcttraineval=1; --pow=false; --pretrainWindow=0; --replabel=0; --reportiters=2500; --rightWindowSize=50; --rndv_filepath=/checkpoint/vineelkpratap/experiments/speech/inference_tds//inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8/rndvz.21621542; --rundir=/checkpoint/vineelkpratap/experiments/speech/inference_tds/; --runname=inference_paper_500ms_do0.1_lr0.4_G32_archtds_k10s_d8_p100m_do0.1_saug_mln_500ms.arch_bch8; --samplerate=16000; --sampletarget=0; --samplingstrategy=rand; --saug_fmaskf=27; --saug_fmaskn=2; --saug_start_update=-1; --saug_tmaskn=2; --saug_tmaskp=1; --saug_tmaskt=100; --sclite=; --seed=0; --show=false; --showletters=false; --silscore=0; --smearing=none; --smoothingtemperature=1; --softwoffset=10; --softwrate=5; --softwstd=5; --sqnorm=true; --stepsize=1000000; --surround=; --tag=; --target=tkn; --test=; --tokens=tokens.txt; --tokensdir=; --train=/checkpoint/antares/datasets/librispeech/lists/train-clean-100.lst,/checkpoint/antares/datasets/librispeech/lists/train-clean-360.lst,/checkpoint/antares/datasets/librispeech/lists/train-other-500.lst,/checkpoint/vineelkpratap/experiments/speech/librivox.cut.sub36s.datasets.lst; --trainWithWindow=false; --transdiag=0; --unkscore=-inf; --use_memcache=false; --uselexicon=true; --usewordpiece=true; --valid=/checkpoint/antares/datasets/librispeech/lists/dev-clean.lst,/checkpoint/antares/datasets/librispeech/lists/dev-other.lst; --warmup=1; --weightdecay=0; --wordscore=0; --wordseparator=_; --world_rank=0; --world_size=32; --outdir=/root/host/out_model/; --alsologtoemail=; --alsologtostderr=false; --colorlogtostderr=false; --drop_log_memory=true; --log_backtrace_at=; --log_dir=; --log_link=; --log_prefix=true; --logbuflevel=0; --logbufsecs=30; --logemaillevel=999; --logfile_mode=436; --logmailer=/bin/mail; --logtostderr=true; --max_log_size=1800; --minloglevel=0; --stderrthreshold=2; --stop_logging_if_full_disk=false; --symbolize_stacktrace=true; --v=0; --vmodule=;
terminate called after throwing an instance of 'std::runtime_error'
  what():  Invalid dictionary filepath specified tokens.txt
*** Aborted at 1595584699 (unix time) try "date -d @1595584699" if you are using GNU date ***
PC: @     0x7f3301a0ee97 gsignal
*** SIGABRT (@0x12) received by PID 18 (TID 0x7f330ce43840) from PID 18; stack trace: ***
    @     0x7f3303215890 (unknown)
    @     0x7f3301a0ee97 gsignal
    @     0x7f3301a10801 abort
    @     0x7f3302403957 (unknown)
    @     0x7f3302409ae6 (unknown)
    @     0x7f3302409b21 std::terminate()
    @     0x7f3302409d54 __cxa_throw
    @     0x56357f082330 (unknown)
    @     0x7f33019f1b97 __libc_start_main
    @     0x56357f0ee97a (unknown)
Aborted

Model folder is looks like


root@1e4cb9f61569:~/host# ll /root/host/model/
total 533888
drwxr-xr-x 9 root root       288 Jul 24 09:56 ./
drwxr-xr-x 6 root root       192 Jul 23 11:40 ../
-rw-r--r-- 1 root root      6148 Jul 23 14:13 .DS_Store
-rw-rw-rw- 1 root root  12746595 Jul 20 10:28 3-gram.pruned.3e-7.bin.qt
-rw-rw-rw- 1 root root       654 Jul 20 10:27 am_500ms_future_context.arch
-rw-rw-rw- 1 root root 460478967 Jul 20 10:28 am_500ms_future_context_dev_other.bin
-rw-rw-rw- 1 root root  42816162 Jul 20 10:28 decoder-unigram-10000-nbest10.lexicon
-rw-rw-rw- 1 root root  19537672 Jul 20 10:27 librispeech-train+dev-unigram-10000-nbest10.lexicon
-rw-rw-rw- 1 root root     82982 Jul 20 10:27 librispeech-train-all-unigram-10000.tokens

mironnn commented 4 years ago

I renamed librispeech-train-all-unigram-10000.tokens -> tokens.txt and am_500ms_future_context.arch -> arch.txt and was able to build with this command ./streaming_tds_model_converter -am /root/host/model/am_500ms_future_context_dev_other.bin --outdir /root/host/out_model/ --tokensdir /root/host/model --archdir /root/host/model

And receive 3 files

253M Jul 24 15:02 acoustic_model.bin
130B Jul 24 15:02 feature_extractor.bin
81K Jul 24 15:02 tokens.txt

But when I was trying to run it with simple_streaming_asr_example I receive failed to open decoder options file=/root/host/converted_streaming_model/decoder_options.json for reading Where or how should I get decoder_options.json?

I used streaming_convnets and I think that it has only acoustic model

tlikhomanenko commented 4 years ago

In the inference it is running not only acoustic model forward but also beam-search decoding with some n-gram LM. Please have a look on tutorial here https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples.\

for streaming convnets we also have beam search decoding on top of the acoustic model, https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/streaming_convnets/librispeech/decode_500ms_right_future_ngram_other.cfg.

mironnn commented 4 years ago

In the inference it is running not only acoustic model forward but also beam-search decoding with some n-gram LM. Please have a look on tutorial here https://github.com/facebookresearch/wav2letter/wiki/Inference-Run-Examples.\

for streaming convnets we also have beam search decoding on top of the acoustic model, https://github.com/facebookresearch/wav2letter/blob/master/recipes/models/streaming_convnets/librispeech/decode_500ms_right_future_ngram_other.cfg.

Hi, thank you for your help. I took Language model 3-gram.pruned.3e-7.bin.qtlink, renamed it to language_model.bin and I made decoder_option.json with such content

{
  "am": "acoustic_model.bin",
  "tokensDir": ".",
  "tokens": "tokens.txt",
  "lexicon": "lexicon.txt",
  "useLexicon": true,
  "decoderType": "wrd",
  "lmType": "kenlm",
  "lmWeight": 0.674,
  "wordScore" : 0.628,
  "unkScore" : -Infinity,
  "silScore": 0,
  "beamSize": 100,
  "beamSizeToken": 100,
  "beamThreshold": 100,
  "nthread_decoder": 8,
  "smearing": "max",
  "eosScore" : 0.0,
  "logAdd" : false,
  "criterionType" : "CTC"
}

And trued to start simple_streaming_asr_example with my converted model and had error:

Started converting audio input from stdin to text... ... 
Creating LexiconDecoder instance.
#start (msec), end(msec), transcription
terminate called after throwing an instance of 'std::invalid_argument'
  what():  size must be devisible in alphabet size in Decoder::run(input=59988, size=59988) alphabet size=9997
Aborted

Could you please advice, where I'm wrong

tlikhomanenko commented 4 years ago

cc @avidov @vineelpratap

flashlight / wav2letter

Inference own trained model on Docker CPU image #524