alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 341 forks source link

implementing "continuous" decoder client #17

Closed mosherayman closed 9 years ago

mosherayman commented 9 years ago

I am implementing a continuous decoder client in IOS.

It takes microphone input and uses the websocket protocol to send it to the gstreamer decoder.

a few questions:

1) I notice i get many "final:false" responses that have the same hypothesis. Is there an option to have the server not send out identical responses? Do you recommend that this be implemented?

2) Will the server break up utterances based on silence? If so how do you set the appropriate parameters?

3) Related, will the server send a final result even if I do not send the EOS string?

alumae commented 9 years ago

1) Currently, the server sends out intermediate hypotheses every 0.5 seconds (changeable using the traceback-period-in-secs decoder parameter). I think it's a good idea not to send out non-final hypothesis if it hasn't changed, I'll look at it.

2) Yes, it breaks up speech based on silence, when the do-endpointing decoder parameter is set to True (see the sample_english_nnet2.yaml). There are many parameters that can be set to change how this endpointing is exactly done, check the endpoint* parameters of the decoder (I'm assuming you use the new DNN-based decoder).

3) If you won't send EOS, decoder assumes that there is more audio coming and waits (until silence-timeout seconds pass, then the connection is closed by the server).

mosherayman commented 9 years ago

thanks

i dont see the endpoint* parameters in the decoder, which file should I look in?

alumae commented 9 years ago

If we use the Kaldi DNN-based decoder (https://github.com/alumae/gst-kaldi-nnet2-online), then the properties specified in the configuration YAML file, nested under decoder, will be forwarded to the plugin. To see which properties are avialble, use gst-inspect-1.0 kaldinnet2onlinedecoder.

The properties that change the way endpointing is done are:

do-endpointing      : If true, apply endpoint detection, and split the audio at endpoints
endpoint-silence-phones: List of phones that are considered to be silence phones by the endpointing code.
endpoint-rule1-must-contain-nonsilence: If true, for this endpointing rule to apply there mustbe nonsilence in the best-path traceback.
endpoint-rule1-min-trailing-silence: This endpointing rule requires duration of trailing silenceto be >= this value.
endpoint-rule1-max-relative-cost: This endpointing rule requires relative-cost of final-states to be <= this value (describes how good the probability of final-states is).
endpoint-rule1-min-utterance-length: This endpointing rule requires utterance-length (in seconds) to be >= this value.
endpoint-rule2-must-contain-nonsilence: If true, for this endpointing rule to apply there mustbe nonsilence in the best-path traceback.
endpoint-rule2-min-trailing-silence: This endpointing rule requires duration of trailing silenceto be >= this value.
endpoint-rule2-max-relative-cost: This endpointing rule requires relative-cost of final-states to be <= this value (describes how good the probability of final-states is).
endpoint-rule2-min-utterance-length: This endpointing rule requires utterance-length (in seconds) to be >= this value.
endpoint-rule3-must-contain-nonsilence: If true, for this endpointing rule to apply there mustbe nonsilence in the best-path traceback.
endpoint-rule3-min-trailing-silence: This endpointing rule requires duration of trailing silenceto be >= this value.
endpoint-rule3-max-relative-cost: This endpointing rule requires relative-cost of final-states to be <= this value (describes how good the probability of final-states is).
endpoint-rule3-min-utterance-length: This endpointing rule requires utterance-length (in seconds) to be >= this value.
endpoint-rule4-must-contain-nonsilence: If true, for this endpointing rule to apply there mustbe nonsilence in the best-path traceback.
endpoint-rule4-min-trailing-silence: This endpointing rule requires duration of trailing silenceto be >= this value.
endpoint-rule4-max-relative-cost: This endpointing rule requires relative-cost of final-states to be <= this value (describes how good the probability of final-states is).
endpoint-rule4-min-utterance-length: This endpointing rule requires utterance-length (in seconds) to be >= this value.
endpoint-rule5-must-contain-nonsilence: If true, for this endpointing rule to apply there mustbe nonsilence in the best-path traceback.
endpoint-rule5-min-trailing-silence: This endpointing rule requires duration of trailing silenceto be >= this value.
endpoint-rule5-max-relative-cost: This endpointing rule requires relative-cost of final-states to be <= this value (describes how good the probability of final-states is).
endpoint-rule5-min-utterance-length: This endpointing rule requires utterance-length (in seconds) to be >= this value.

You might have to dig into the Kaldi sources to understand what exactly different properties do. The defaults are however pretty good.