Feature request: Streaming inference with ffmpeg local port

RESDXChgfore9hing commented 1 year ago

If you have a feature request, then please provide the following information:

A clear and concise description of what the problem is. I'm curious if there is a *less hackish way to directly use the client cli directly do the inference on a realtime audio stream.

Describe the solution you'd like Ideally a simple flag on the cli,with sufficient information to it eg,port number.

Describe alternatives you've considered Directly pipe the stream from ffmpeg to the stt.

Additional context

wasertech commented 1 year ago

Take a look at https://github.com/coqui-ai/STT-examples/tree/r1.0/ffmpeg_vad_streaming

RESDXChgfore9hing commented 1 year ago

it uses the js vad packages right?I assume the stt itself doesnt support vad natively? so we need handle th vad logic ourselves?

wasertech commented 1 year ago

it uses the js vad packages right?

This particular example uses nodeJS yes.

I assume the stt itself doesnt support vad natively? so we need handle th vad logic ourselves?

There are multiple ways to handle voice activity detection. From using a dedicated library like webrtcVAD to not handling VAD at all and streaming continuously. It's specific to your needs so STT doesn't force any particular option for you.

RESDXChgfore9hing commented 1 year ago

ahhh i see,Ok ok my bad VAD is another thing itself and STT by default just get whatever is feed into it right(continuous by default)?so we need other function to do the VAD. After re-read the example,i see the rtmp is directly handled by the ffmpeg and then it calls the STT js api to run the inference, Is there a equivalent to:ffmpeg>STT,cli ? without the use of api.perhaps the api is just some kind of cli command*builder?

wasertech commented 1 year ago

Is there a equivalent to:ffmpeg>STT,cli ?

Please don't. Using a shell script to achieve this is not appropriate.

I suggest a python script you can call from the shell that handles recording and processing of the audio directly to STT.

I wrote listen to do so. It's handles VAD, all supported languages and it's easy to use.

❯ listen --help
usage: listen [-h] [-f FILE] [--aggressive {0,1,2,3}] [-d MIC_DEVICE]
                   [-w SAVE_WAV]

Transcribe long audio files using webRTC VAD or use the streaming interface
from a microphone

options:
  -h, --help            show this help message and exit
  -f FILE, --file FILE  Path to the audio file to run (WAV format)
  --aggressive {0,1,2,3}
                        Determines how aggressive filtering out non-speech is.
                        (Interger between 0-3)
  -d MIC_DEVICE, --mic_device MIC_DEVICE
                        Device input index (Int) as listed by
                        pyaudio.PyAudio.get_device_info_by_index(). If not
                        provided, falls back to PyAudio.get_default_device().
  -w SAVE_WAV, --save_wav SAVE_WAV
                        Path to directory where to save recorded sentences
  --debug               Show debug info

It's a mix of mic_vad_streaming, vad_transcriber and python_websocket_server from STT-examples

I'll close your issue and convert it to a discussion about streaming audio from the CLI instead.

coqui-ai / STT

Feature request: Streaming inference with ffmpeg local port #2298