Feature: Speech-to-text module using Vosk, Whisper

This module provide pseudo-streaming speech-to-text using Vosk and Whisper. Tried to not add too much to server.py, so put the stt module into their own files and used "add_url_rule" to add the api routes.

Features

STT providers
- Vosk: open source STT with a library natively allowing streaming voice in real time
- Whisper: open source STT, better accuracy than Vosk but need speech detection done beforehand. Currently using Vosk to cut voice for Whisper.

What changed

Server arguments
- vosk-stt: activate Vosk module
- whisper-stt: activate Whisper module -stt-microphone-id: set the input device for sounddevice library, if not set default mic will be selected and the list of device will be print.
- stt-vosk-model-path: path to vosk model, if not given it will be downloaded automatically and store in user cache folder.
- stt-whisper-model-path: same but for whisper model. Default model are the smallest english ones about 100Mb.
New API routes
- /api/stt/vosk/record: start a recording of user microphone using sounddevice library, raw audio block of fix size are stored in a queue in a parallel callback thread. Vosk process the queue block per block until it detect end of speech. The finished transcript is return as a string.
- /api/stt/vosk/record: for now kinda trivial, just use vosk as previously for speech capture then save complete audio to file that is then processed by whisper, return the transcript of whisper, just print the one of whisper for debug info.
Requirements
- packages
  - sounddevice (microphone audio capture)
  - vosk (for Vosk STT)
  - openai-whisper (for Whisper STT)
- ffmpeg (suposed to be needed by whisper, not sure if it install via pip or need external one, I do have both)

Tests

Tested only on Windows 11
the audio recording of last message is stored in a file "stt_test.wav" that can be used to assess audio quality or just checking if recording works.
Unpluging or pluging additional device should just raise a stream error that is captured during the audio processing
Running whisper can use about 1Gb VRAM.
example of command:
- python server.py --enable-modules=vosk-stt
- python server.py --enable-modules=whisper-stt
- python server.py --enable-modules=whisper-stt --stt-microphone-id=1
- python server.py --enable-modules=vosk-stt --vosk-stt-model-path=modules\stt\vosk-model-en-us-0.22 --stt-microphone-id 0

SillyTavern / SillyTavern-Extras

Feature: Speech-to-text module using Vosk, Whisper #84

Features

What changed

Tests