Open ldeseynes opened 6 years ago
Hello, ldeseynes, I am wonderful that you get some considerable result for french stt. I've also tried to use gst-kaldi-nnet2-online to decode french speech. but I've got only one word for a call. So I've tested it on offline-decode, then it gives considerable result. I configured worker.yaml like this:
use-nnet2: True decoder:
# Use gst-inspect-1.0 ./libgstkaldionline2.so kaldinnet2onlinedecoder to discover the available properties
use-threaded-decoder: True
model : test/models/french/librifrench/final.mdl
word-syms : test/models/french/librifrench/words.txt
fst : test/models/french/librifrench/HCLG.fst
mfcc-config : test/models/french/librifrench/conf/mfcc.conf
ivector-extraction-config : test/models/french/librifrench/conf/ivector_extractor.conf
max-active: 10000
beam: 40.0
lattice-beam: 6.0
acoustic-scale: 0.083
do-endpointing : true
endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10"
traceback-period-in-secs: 0.25
chunk-length-in-secs: 0.25
num-nbest: 10
#Additional functionality that you can play with:
#lm-fst: test/models/english/librispeech_nnet_a_online/G.fst
#big-lm-const-arpa: test/models/english/librispeech_nnet_a_online/G.carpa
#phone-syms: test/models/english/librispeech_nnet_a_online/phones.txt
#word-boundary-file: test/models/english/librispeech_nnet_a_online/word_boundary.int
#do-phone-alignment: true
out-dir: tmp
use-vad: False silence-timeout: 10
post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;' Then I've got following log message.
gst worker server: 2019-09-18 01:37:14 - INFO: main: d01660ed-3b16-4401-a941-187b3bceb971: Postprocessing done. 2019-09-18 01:37:14 - DEBUG: main: d01660ed-3b16-4401-a941-187b3bceb971: After postprocessing: {u'status': 0, u'segment-start': 5.18, u'segment-length': 3.54, u'total-length': 8.72, u'result': {u'hypotheses': [{u'likelihood': -6.27475, u'transcript': u'il.', 'original-transcript': u'il'}, {u'likelihood': -7.54526, u'transcript': u'ils.', 'original-transcript': u'ils'}, {u'likelihood': -8.89724, u'transcript': u'il jeta.', 'original-transcript': u'il jeta'}, {u'likelihood': -10.2568, u'transcript': u'il je.', 'original-transcript': u'il je'}, {u'likelihood': -10.3752, u'transcript': u"il j'.", 'original-transcript': u"il j'"}, {u'likelihood': -10.9077, u'transcript': u'il ne.', 'original-transcript': u'il ne'}, {u'likelihood': -11.0849, u'transcript': u'ils je.', 'original-transcript': u'ils je'}, {u'likelihood': -11.1516, u'transcript': u"ils j'.", 'original-transcript': u"ils j'"}, {u'likelihood': -11.2717, u'transcript': u'il me.', 'original-transcript': u'il me'}, {u'likelihood': -11.2749, u'transcript': u'de.', 'original-transcript': u'de'}], u'final': True}, 'segment': 0, 'id': u'd01660ed-3b16-4401-a941-187b3bceb971'}
gst master server INFO 2019-09-18 01:37:09,763 d01660ed-3b16-4401-a941-187b3bceb971: Sending event {u'status': 0, u'segment': 0, u'result': {u'hypotheses': [{u'transcript': u'de.'}], u'final': Fal... to client INFO 2019-09-18 01:37:14,013 d01660ed-3b16-4401-a941-187b3bceb971: Sending event {u'status': 0, u'segment-start': 5.18, u'segment-length': 3.54, u'total-length': 8.72, u'result':... to client INFO 2019-09-18 01:37:14,024 d01660ed-3b16-4401-a941-187b3bceb971: Sending event {u'status': 0, u'adaptation_state': {u'type': u'string+gzip+base64', u'id': u'd01660ed-3b16-4401-... to client
client Audio sent, now sending EOS il.
I am very grateful what's wrong on my configuration. How can I get correct result? Thanks in helpness.
Hi, In your yaml config file, set acoustic-sclale to 1.0 and add frame-subsampling-factor: 3
Hello, I changed config file and tested again. """ use-nnet2: True decoder:
# Use gst-inspect-1.0 ./libgstkaldionline2.so kaldinnet2onlinedecoder to discover the available properties
use-threaded-decoder: True
model : test/models/french/librifrench/final.mdl
word-syms : test/models/french/librifrench/words.txt
fst : test/models/french/librifrench/HCLG.fst
mfcc-config : test/models/french/librifrench/conf/mfcc.conf
ivector-extraction-config : test/models/french/librifrench/conf/ivector_extractor.conf
max-active: 10000
beam: 13.0
lattice-beam: 8.0
acoustic-scale: 1.0
frame-subsampling-factor: 3
#acoustic-scale: 0.083
do-endpointing : true
#endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10"
endpoint-silence-phones : "1:2:3:4:5"
traceback-period-in-secs: 0.25
chunk-length-in-secs: 0.25
num-nbest: 10
#Additional functionality that you can play with:
#lm-fst: test/models/english/librispeech_nnet_a_online/G.fst
#big-lm-const-arpa: test/models/english/librispeech_nnet_a_online/G.carpa
#phone-syms: test/models/english/librispeech_nnet_a_online/phones.txt
#word-boundary-file: test/models/english/librispeech_nnet_a_online/word_boundary.int
#do-phone-alignment: true
out-dir: tmp """ So I've got following result. """ une. l' ai.jamais. de. Audio sent, now sending EOS de. une. l' ai. de. """ Could you share your parameter? I can share my model. I hope to get your helpness.
I just found following command: online2-wav-nnet2-latgen-faster --online=true --do-endpointing=false --config=exp/nnet2_online/nnet_ms_a_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=exp/tri4b/graph_SRILM/words.txt exp/nnet2_online/nnet_ms_a_online/final.mdl exp/tri4b/graph_SRILM/HCLG.fst ark:data/test_hires/split8/1/spk2utt 'ark,s,cs:extract-segments scp,p:data/test_hires/split8/1/wav.scp data/test_hires/split8/1/segments ark:- |' 'ark:|gzip -c > exp/nnet2_online/nnet_ms_a_online/decode_SRILM/lat.1.gz'
It decodes audio files like this:LOG (online2-wav-nnet2-latgen-faster[5.5.463~1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0030 13-1410-0031 rez de LOG (online2-wav-nnet2-latgen-faster[5.5.463~1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0031 13-1410-0032 de LOG (online2-wav-nnet2-latgen-faster[5.5.463~1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0032 13-1410-0033 rit de LOG (online2-wav-nnet2-latgen-faster[5.5.463~1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0033 13-1410-0034 de LOG (online2-wav-nnet2-latgen-faster[5.5.463~1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0034 13-1410-0035 de
I am wonderful if you carefully check it.
Hi,
Here are the parameters I set but I have not used the system for a while. In your yaml file, you should add nnet-mode: 3. Also, check that you're decoding your audio file with the correct sample rate and number of channels.
use-threaded-decoder=true nnet-mode=3 frame-subsampling-factor=3 acoustic-scale=1.0 model=models/final.mdl fst=models/HCLG.fst word-syms=models/words.txt phone-syms=models/phones.txt word-boundary-file=models/word_boundary.int num-nbest=10 num-phone-alignment=3 do-phone-alignment=true feature-type=mfcc mfcc-config=models/conf/mfcc.conf ivector-extraction-config=models/conf/ivector_extractor.conf max-active=1000 beam=11.0 lattice-beam=5.0 do-endpointing=true endpoint-silence-phones="1:2:3:4:5:6:7:8:9:10" chunk-length-in-secs=0.23 phone-determinize=true determinize-lattice=true frames-per-chunk=10
Thanks for your kindly reply. you seems using NNet3 online model. I've trained NNet2 online model. So I've tried to use nnet-mode=3, but engine was crushed. let meow your idea for it. Best, Ting
What's your command to start the scripts ?
I am using gstreamer server and client by using https://github.com/alumae/kaldi-gstreamer-server like this:
master server python kaldigstserver/master_server.py --port=8888
'onlinegmmdecodefaster' based worker python kaldigstserver/worker.py -u ws://localhost:8888/worker/ws/speech -c french_stt.yaml
french_stt.yaml as follows: use-nnet2: True decoder: use-threaded-decoder: True model : test/models/french/librifrench/final.mdl word-syms : test/models/french/librifrench/words.txt fst : test/models/french/librifrench/HCLG.fst mfcc-config : test/models/french/librifrench/conf/mfcc.conf ivector-extraction-config : test/models/french/librifrench/conf/ivector_extractor.conf max-active: 1000 beam: 13.0 lattice-beam: 8.0 acoustic-scale: 1.0 frame-subsampling-factor: 3 do-endpointing : true nnet-mode : 2 endpoint-silence-phones : "1:2:3:4:5" traceback-period-in-secs: 0.25 chunk-length-in-secs: 0.25 num-nbest: 10 frames-per-chunk : 10 out-dir: tmp
use-vad: False silence-timeout: 10
post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'
logging: version : 1 disable_existing_loggers: False formatters: simpleFormater: format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s' datefmt: '%Y-%m-%d %H:%M:%S' handlers: console: class: logging.StreamHandler formatter: simpleFormater level: DEBUG root: level: DEBUG handlers: [console]
I've confirmed test.wav is 16KHz, 16bit, mono.
Let me know your idea for it. I look forward from you.
This looks fine to me. Just check the parameters you used for your training (acoustic scale and frame subsampling factor) because I'm not sure about their value in the nnet2 setup. Anyway you'd rather use a later model if you want to get decent results.
Thanks for your reply. Did you think ever about nnet-latgen-faster and online2-wav-nnet2-latgen-faster of kaldi? I think there may be some problem in the difference above two decoding method. And could you tell me more about a later model?
Just use a chain model, you'll get better results and far more details about the recipe
I've built French STT model by using wsj/s5/local/online/run_nnet2.sh. Thanks let me try again.
One thing, You mean NNet3 chain model? Could you tell me which script shall I use?
Sure, you can retrain a model using tedlium/s5_r3/run.sh. You don't need the rnnlm stuff after stage 18 for your Gstreamer application
Thank you, will try.
Hi Tanel ! First of all thanks for your great job.
I'm using gst-kaldi-nnet2-online to decode french speech. When running the client.py script, I get a quite correct transcription using my model but at some point the decoding stops and starts again a few seconds later. This results in missing words in the output. Here is an example of the result with two dots corresponding to the missing words at the end of each sentence:
`bonjour , je m' appelle Jean-Christophe je suis agriculteur dans le Loiret sur une exploitation céréalières . je me suis installé il y a une dizaine d' années ..