alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 341 forks source link

Streaming Chunked Transfer Encoding #159

Closed dpny518 closed 5 years ago

dpny518 commented 5 years ago

This command seems to work well with a file curl -v -T test/data/english_test.raw -H "Content-Type: audio/x-raw-int; rate=16000" --header "Transfer-Encoding: chunked" --limit-rate 32000 "http://localhost:8888/client/dynamic/recognize"

but one limitation is that RTF> 1, which is fine however, if we have system that a user speaks 10 seconds, they stop, then we send the data up, its another 10+ seconds before we get the data back, what would be ideal is to send the chunks up as the user speaks, so as soon as he stops speaking the results are almost back, RTF ~ 1 So based on the curl above how can we work with system mics handle end of speech and decoding

dpny518 commented 5 years ago

This seems to work, but the response shows in the master faster than the worker, and then to get back to the terminal doing the curl takes even longer

arecord -f S16_LE -r 16000 | curl -v -T - -H "Content-Type: audio/x-raw-int; rate=16000" --header "Transfer-Encoding: chunked" --limit-rate 32000 "http://localhost:8888/client/dynamic/recognize"

alumae commented 5 years ago

Sorry, but I couldn't understand the problem that you are having. Could you explain it in more detail?

dpny518 commented 5 years ago

Basically if I use the http curl or the client.py I can send an audio file and it will send EOS and everything works well, if I pipe in the mic through standard input how can I send the EOS bytes as soon as I stop speaking?

alumae commented 5 years ago

OK, basically you want to stop recording when you detect that the user has stopped speaking? You can use the 'silence' effect of sox (and rec) for this, e.g.:

rec -q -b 16 -c 1 -r 16000 -t wav - silence 0 1 3.0 3% | python2 kaldigstserver/client.py -r 32000

Or using the HTTP interface:

rec -q -b 16 -c 1 -r 16000 -t wav - silence 0  1 3.0 3%  | curl -v -T - -H "Content-Type: audio/x-raw-int; rate=16000" --header "Transfer-Encoding: chunked" --limit-rate 32000  "http://localhost:8888/client/dynamic/recognize"
dpny518 commented 5 years ago

thank you