alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 341 forks source link

gstkaldinnet2onlinedecoder vs online2-tcp-nnet3-decoder-faster #241

Closed Umar17 closed 4 years ago

Umar17 commented 4 years ago

Hi,

I just experimented online decoding with online2-tcp-nnet3-decoder-faster which was being done using kaldinnet2onlinedecoder (through kaldi-gstreamer-server) earlier. I experienced about 3 times faster decoding with online2-tcp-nnet3-decoder-faster. I went through codes of both decoders and realized that the working is fairly identical. Can you please guide why is the later is faster? Is it my mistake or something else?

PS: parameters (like beam, lattice beam and maximum-active) kept identitical for both decoders.

Best Regards Umar

alumae commented 4 years ago

Probably you are using chain models and are missing the attribute frame-subsampling-factor: 3 under the decoder conf in the YAML file.

Umar17 commented 4 years ago

Yes I am using chain file but frame-subsampling-factor option is in place. Attached is my yaml file.


use-nnet2: True decoder: use-threaded-decoder: True
nnet-mode : 3 model : /home/cle-26/Downloads/kaldi-gstreamer-server-master/nnet3_chain/final.mdl word-syms : /home/cle-26/Downloads/kaldi-gstreamer-server-master/nnet3_chain/words.txt fst : /home/cle-26/Downloads/kaldi-gstreamer-server-master/nnet3_chain/HCLG.fst mfcc-config : /home/cle-26/Downloads/kaldi-gstreamer-server-master/nnet3_chain/conf/mfcc.conf ivector-extraction-config : /home/cle-26/Downloads/kaldi-gstreamer-server-master/nnet3_chain/conf/ivector_extractor.conf max-active: 10000 beam: 10.0 lattice-beam: 6.0 acoustic-scale: 1.0 do-endpointing : true endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10" traceback-period-in-secs: 0.01 chunk-length-in-secs: 0.25 frame-subsampling-factor: 3 num-nbest: 10

Additional functionality that you can play with:

#lm-fst:  test/models/english/librispeech_nnet_a_online/G.fst
#big-lm-const-arpa: test/models/english/librispeech_nnet_a_online/G.carpa
phone-syms: /home/cle-26/Downloads/kaldi-gstreamer-server-master/nnet3_chain/phones.txt
#word-boundary-file: test/models/english/librispeech_nnet_a_online/word_boundary.int
#do-phone-alignment: true

out-dir: tmp/urdu

use-vad: False silence-timeout: 60

post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'

logging: version : 1 disable_existing_loggers: False formatters: simpleFormater: format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s' datefmt: '%Y-%m-%d %H:%M:%S' handlers: console: class: logging.StreamHandler formatter: simpleFormater level: DEBUG root: level: DEBUG handlers: [console]


And the client command is this: python kaldigstserver/client.py -r 32000 c2a.wav where sample wave file is sampled at 16KHz.

Umar17 commented 4 years ago

I have tweaked frame-subsampling-factor and ironically it is not putting any effect on latency

alumae commented 4 years ago

Can you give some numbers -- the actual difference in decoding time that you are seeing?

I assume you understand that -r 32000 option in client.py means that the audio is sent to the server using this byte rate. If the wav is indeed using 16 kHz 16-bit encoding, then the decoding cannot be completed faster than realtime, as the audio is sent to the server using a rate that simulates realtime recording from the mic.

Umar17 commented 4 years ago

Numbers (in milliseconds) Audio length: 4923 Latency (with -r 32000): 5801 Latency (with -r 256000): 2965 Latency (online2-tcp-nnet3-decode-faster): 1343

Yes, I understand the byte rate and I experimented with -r 256000 as well which should send the whole audio within first second (the intuition is to imitate client for online2-tcp-nnet3-decode-faster that feeds whole audio and half-shutdown socket connection). It doesn't effect accuracy and improves efficiency a bit.

alumae commented 4 years ago

Try changing to traceback-period-in-secs: 0.25.

Umar17 commented 4 years ago

Tried but no effect. However, average of multiple experiments gives a difference of ~1 second in latency with r -256000 and tcp decoder. I think the latency increases in gstreamer case due to server-worker-decoder architecture and communication goes slow than in case of online2-tcp-nnet3-decode-faster server. If it is so, this issue can be closed.