SillyTavern / SillyTavern-Extras

Extensions API for SillyTavern.
GNU Affero General Public License v3.0
563 stars 133 forks source link

Speech Recognition Streaming only transcribing "You" #187

Open theman23290 opened 12 months ago

theman23290 commented 12 months ago

Then the speech recognition is streaming the transcribed output is always "you". It is using whisper for the transcribing. When I specifically use whisper and click on the microphone it works perfectly. But when streaming it only shows the word "you" on the terminal even if I don't say anything. I can confirm the microphone is activated when recording the audio. I have used SillyTavern on Windows 11, Debian, and Modded Debian with the same result. Any recommendations on what I can do to resolve this? I am on the latest ffmpeg, running the latest Extras in conda, and have enough horsepower to run the Extras program as intended.

theman23290 commented 12 months ago

This issue seems to be related to this issue with Whisper: https://github.com/openai/whisper/discussions/679 TLDR: Implement --condition_on_previous_text and VAD, and the issues go away. Any way to implement that fix into this project?

Cohee1207 commented 12 months ago

That's for @Tony-sama to consider.

Cohee1207 commented 12 months ago

Check the recent commit. Is that what you asked?

theman23290 commented 12 months ago

I believe so. The fix still didn't fix the original issue though. I don't know if this is a whisper issue or if it is an issues with how whisper is implemented in this code. Here is the output on the terminal while a client is connected through api.

/home/senpai/miniconda/envs/extras/lib/python3.11/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead")

Transcripted from audio file (whisper): you 172.18.0.2 - - [19/Nov/2023 21:31:21] "POST /api/speech-recognition/streaming/record-and-transcript HTTP/1.1" 200 - 172.18.0.2 - - [19/Nov/2023 21:31:21] "OPTIONS /api/speech-recognition/streaming/record-and-transcript HTTP/1.1" 200 - Start recording from: default with samplerate 44100 Transcripted from microphone stream (vosk): Recorded message saved to stt_test.wav /home/senpai/miniconda/envs/extras/lib/python3.11/site-packages/whisper/transcribe.py:115: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead") Transcripted from audio file (whisper): you 172.18.0.2 - - [19/Nov/2023 21:31:27] "POST /api/speech-recognition/streaming/record-and-transcript HTTP/1.1" 200 - 172.18.0.2 - - [19/Nov/2023 21:31:27] "OPTIONS /api/speech-recognition/streaming/record-and-transcript HTTP/1.1" 200 - Start recording from: default with samplerate 44100 Transcripted from microphone stream (vosk): Recorded message saved to stt_test.wav It repeats this output until the client disconnects. IDK where the bug is. From the research that I look into, it is more of an issue with the way whisper is implemented.
Statford commented 9 months ago

Hi, I had the same problem and all I did was leave it for a week, reboot it, and it (for whatever reason) worked perfectly after that. I wish I could be more helpful than that, but I had the same problem with my installation of whisper. https://github.com/SillyTavern/SillyTavern-Extras/issues/217