Feature: Speech-to-text module using Vosk, Whisper from audio file sent by ST.

revision of PR#84, too lazy to solve all conflict of rebase so simply inject the new part in neo branch.

This module provide speech-to-text from audio file sent by ST using Vosk or Whisper. Tried to not add too much to server.py, so put the stt module into their own files and used "add_url_rule" to add the api routes.

Features

STT providers
- Vosk: open source STT with a library natively allowing streaming voice in real time
- Whisper: open source STT, better accuracy than Vosk but need speech detection done beforehand. Currently using Vosk to cut voice for Whisper.

What changed

Server arguments
- vosk-stt: activate Vosk module
- whisper-stt: activate Whisper module
- stt-vosk-model-path: path to vosk model, if not given it will be downloaded automatically and store in user cache folder.
- stt-whisper-model-path: same but for whisper model. Default model are the smallest english ones about 100Mb.
New API routes
- /api/speech-recognition/vosk/process-audio: Process the audio file sent in the request using Vosk, need to convert it to proper wav format using soundevice, only firefox send compatible file so far Chrome and Edge send uncompatible files.
- /api/speech-recognition/whisper/process-audio: Process the audio file sent in the request using Whisper, no need for converting the file, whisper manage firefox/chrome/edge file directly.
Requirements
- packages
  - sounddevice (to convert wav file into proper format for vosk using wave)
  - vosk (for Vosk STT)
  - openai-whisper (for Whisper STT)
- ffmpeg (suposed to be needed by whisper, not sure if it install via pip or need external one, I do have both)

Tests

Tested only on Windows 11 / firefox 115 / chrome 115 / edge 115
the received audio file is stored in a file "stt_test.wav" that can be used to assess audio quality or just checking if recording works.
Running whisper can use about 1Gb VRAM.
example of command:
- python server.py --enable-modules=vosk-stt
- python server.py --enable-modules=whisper-stt
- python server.py --enable-modules=vosk-stt --vosk-stt-model-path=modules\stt\vosk-model-en-us-0.22

SillyTavern / SillyTavern-Extras

Feature: Speech-to-text module using Vosk, Whisper from audio file sent by ST. #93

Features

What changed

Tests