alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 341 forks source link

Results of nnet3 #157

Closed Umar17 closed 5 years ago

Umar17 commented 5 years ago

Hello all, I trained a simple nnet3 model using train_dnn.py that gave perfect results while I decoded audio files through steps/online/nnet3/decode.sh after preparing online directory (using steps/online/nnet3/prepare_online_decoding.sh). But when I try to use same model with gstreamer, the results are very strange on the same audio files. My yaml file is as: `use-nnet2: True decoder: use-threaded-decoder: True
nnet-mode : 3 model : test/models/nnet_online3/final.mdl word-syms : test/models/nnet_online3/words.txt fst : test/models/nnet_online3/HCLG.fst mfcc-config : test/models/nnet_online3/conf/mfcc.conf ivector-extraction-config : test/models/nnet_online3/conf/ivector_extractor.conf min-active: 200 max-active: 7000 beam: 15.0 lattice-beam: 6.0 acoustic-scale: 0.1 do-endpointing : false endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10:11:12:13:14:15" traceback-period-in-secs: 0.25 chunk-length-in-secs: 0.25 num-nbest: 10 out-dir: tmp

use-vad: False silence-timeout: 10

post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'

logging: version : 1 disable_existing_loggers: False formatters: simpleFormater: format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s' datefmt: '%Y-%m-%d %H:%M:%S' handlers: console: class: logging.StreamHandler formatter: simpleFormater level: DEBUG root: level: DEBUG handlers: [console]`

And configuration of decoder for decode.sh is as follows: online2-wav-nnet3-latgen-faster --do-endpointing=false --frames-per-chunk=20 --extra-left-context-initial=0 --online=true --config=exp/nnet3_online/conf/online.conf --min-active=200 --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=exp/tri3/graph//words.txt exp/nnet3_online/final.mdl exp/tri3/graph//HCLG.fst

And I am decoding file on kaldi-gstreamer through command line: python kaldigstserver/client.py -r 32000 testFiles/RS12_B10_F_UTD_17.wav while my wav file is sampled at 16K

Can anyone please help me where I am making mistake? Or where the decoding of kaldi-gstreamer differs from online/nnet3/decode.sh (I am using same graph, words and phones file for decode script and kaldi-gstreamer)

gilamsalem commented 5 years ago

I am not sure that you should use "-r 32000". Try to use 16000 or 8000. Update if it changes anything.

alumae commented 5 years ago

The "- 32000" sets the byte rate at which data is sent to the server. It shouldn't change the results at all.

What kind of nnet3 model it is? TDNN or BLSTM? Is it a chain model?

Umar17 commented 5 years ago

of course. Even if byte rate effects, byte rate is correct (16K*16/8=32K). @alumae its TDNN model (as mentioned in post) trained by steps/nnet3/train_dnn.py

Best Regards

alumae commented 5 years ago

Are the result using kaldi-gstreamer-server totally off (i.e., does it produce total garbage) or just much worse than using pure Kaldi?

Umar17 commented 5 years ago

total garbage.

alumae commented 5 years ago

Then I suspect that the words.txt file that you are using with the server is not the same as you are using with native Kaldi.

Umar17 commented 5 years ago

I have double checked that the files are same. Through server, I am getting just 3-4 words for even 10sec long utterances. If words.txt is issue, then atleast it should decode a long garbage against long utterance.

alumae commented 5 years ago

Then perhaps it's the ivector extractor that's wrong in the server conf?

Umar17 commented 5 years ago

My guess was about post processor or tracing back. Can it be? As I roughly noticed that the output becomes worse when trace backed and processed (it reduces even length of initially decoded output).

Umar17 commented 5 years ago

What can be wrong with ivectors? Kaldi offline is also using same ivector conf and models.

alumae commented 5 years ago

Perhaps you have trained many different i-vector extractors and are using the wrong one in the server?

alumae commented 5 years ago

It's not the post-processor -- it just appends "." to the results.

I don't know what you mean with trace backing.

Umar17 commented 5 years ago

By trace backing, I mean this property of yaml file. traceback-period-in-secs:0.25 Anyways, thanks for your too much cooperation. I will check all the things once again if you are this much sure and will be back about progress. Thanks once again.

Best Regards

Umar17 commented 5 years ago

I rebuild the whole model once again and its working now. Thanks for cooperation.

boleamol commented 5 years ago

Hi @Umar17 I am also trying use gstreamer with nnet3, but I am getting completely (100%) wrong result. I am using steps/nnet3/chain/train.py script which builds chain+TDNN model. When I used decode.sh for same model then it is giving 10 % WER means 90% Accuracy. Can you tell me what to do in this case? also tell me which script you have used to build your model and what are the parameters you have modified to get the accuracy? Waiting for your response...

Umar17 commented 5 years ago

I used wsj/s5/local/chain/run_tdnn.sh for chain TDNN. Issue, I faced, was due to language model.

boleamol commented 5 years ago

Thank You for your valuable response @Umar17 . Can you tell me how many words was there in your dictionary? Also, is there any similar example in kaldi so I will try the same? We are using tri gram language model (build bu IRSTLM toolkit) whether you used the same or different?

Umar17 commented 5 years ago

My vocabulary size was around 200K. I'll suggest you to follow kaldi e2e model development script https://github.com/kaldi-asr/kaldi/tree/master/egs/wsj/s5/local/e2e. Kaldi uses SRILM itself. However, if you are getting good results with your LM then IRSTLM can be used.

On Tue, May 28, 2019 at 10:59 AM Amol Bole notifications@github.com wrote:

Thank You for your valuable response @Umar17 https://github.com/Umar17 . Can you tell me how many words was there in your dictionary? Also, is there any similar example in kaldi so I will try the same? We are using tri gram language model (build bu IRSTLM toolkit) whether you used the same or different?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alumae/kaldi-gstreamer-server/issues/157?email_source=notifications&email_token=AD4BNA5F7SUWUSMMJXECSELPXTC4RA5CNFSM4F4PIIW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWLBSMI#issuecomment-496376113, or mute the thread https://github.com/notifications/unsubscribe-auth/AD4BNAZWMDNDG4OMW5GWSKTPXTC4RANCNFSM4F4PIIWQ .

-- Muhammad Umar Farooq Research Officer | Center for Language Engineering, UET Lahore. mufarooq40@gmail.com | +923030463541 | Web Page http://www.cle.org.pk/information/people/muhammadumarfarooq.html

boleamol commented 5 years ago

Thank you for your response.. I will try the same and revert you back. Thank you once again.

boleamol commented 5 years ago

Thank you @Umar17 I build the model and system is working fine. But small issue is there that is time delay. I am testing system live using microphone. After speaking sentence I have to wait for 10-15 seconds to display output. Also some times it is wrong/missing few words. How to resolve?

I was monitoring worker sometime it is giving error " Error, no surviving tokens: frame is -1"

Umar17 commented 5 years ago

Sorry I don't have experience of Kaldi e2e model. However I may be able to suggest something if you can share how are you using it?

On May 29, 2019 10:20 AM, "Amol Bole" notifications@github.com wrote:

Thank you @Umar17 https://github.com/Umar17 I build the model and system is working fine. Thank you. But small issue is there that is time delay. I am testing system live using microphone. After speaking of sentence I have to wait for 10-15 seconds to display output also some times it is giving wrong/missing few words. How to resolve?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alumae/kaldi-gstreamer-server/issues/157?email_source=notifications&email_token=AD4BNA34SFTXV2CINDOYA5TPXYHCLA5CNFSM4F4PIIW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWOGHJA#issuecomment-496788388, or mute the thread https://github.com/notifications/unsubscribe-auth/AD4BNA7WTHVQB2MEKXGLS73PXYHCLANCNFSM4F4PIIWQ .

boleamol commented 5 years ago

I build TDNN chain model using kaldi/egs/commonvoice recipe for my own data. I am getting good accuracy (85-90% accuracy) with this model. Actually I am working on Telephonic speech which is 8khz sampling frequency. I am having around 400 hours data. Before using this model I modified sampling frequnency in kaldi source like kaldi/src/feat/pitch-functions.h , then kaldi/src/gst-plugin and kaldi/src/onlinebin and recompiled kaldi once again. Later I downloaded gst-kaldi-nnet2-online source and compiled it and using kaldi path. My YAML file contains following..

use-nnet2: True decoder:

All the properties nested here correspond to the kaldinnet2onlinedecoder GStreamer plugin properties.

# Use gst-inspect-1.0 ./libgstkaldionline2.so kaldinnet2onlinedecoder to discover the available properties
use-threaded-decoder:  true
nnet-mode : 3
model : test/model_cdac/final.mdl
word-syms : test/model_cdac/words.txt
fst : test/model_cdac/HCLG.fst
mfcc-config : test/model_cdac/conf/mfcc.conf
ivector-extraction-config : test/model_cdac/conf/ivector_extractor.conf
max-active: 7000
beam: 15.0
lattice-beam: 6.0
# acoustic-scale: 0.055
acoustic-scale: 0.083
do-endpointing : false
endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10"
traceback-period-in-secs: 0.25
chunk-length-in-secs: 0.25
frame-subsampling-factor: 3
num-nbest: 10
#Additional functionality that you can play with:
#lm-fst:  test/model_cdac/G.fst
#big-lm-const-arpa: test/model_cdac/G.carpa
#phone-syms: test/model_cdac/phones.txt
word-boundary-file: test/model_cdac/graph/phones/word_boundary.int
#do-phone-alignment: true

out-dir: tmp

use-vad: False silence-timeout: 10

post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'

full-post-processor: ./sample_full_post_processor.py

logging: version : 1 disable_existing_loggers: False formatters: simpleFormater: format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s' datefmt: '%Y-%m-%d %H:%M:%S' handlers: console: class: logging.StreamHandler formatter: simpleFormater level: DEBUG root: level: DEBUG handlers: [console]

dpny518 commented 4 years ago

for chain models like zamia's

    frame-subsampling-factor: 3
    acoustic-scale: 1.0