VRCWizard / TTS-Voice-Wizard

Speech to Text to Speech. Song now playing. Sends text as OSC messages to VRChat to display on avatar. (STTTS) (Speech to TTS) (VRC STT System) (VTuber TTS)
https://TTSVoiceWizard.com
MIT License
608 stars 68 forks source link

Whisper unstability #45

Closed chrisk414 closed 1 year ago

chrisk414 commented 1 year ago

Hi, STT (Whisper) is the biggest use-case for me. I think it's probably the most important feature for now until I can use it reliably. Hopefully, it's the same for everyone as it's the starting point for using TTSVoiceWizard.

Anyway, there is what I find using the latest v.1.5.0 from the github main.

In the Log View, I see the new "Whisper Debug: ..." output. When STT mode is on, it will always shows randomly shows one of the followings. I think it's clear what it means. (A) "Listening" (listening and there is no sound input) (B) "Listening, Voice" (listening and sound input is detected)
(C) "Listening, Transcribing" (processing recorded voice)

But the problem is that they do no accurately represent what's really happening, and the behaviors are bit random.

Here are my observations. (I always launch it from VS Debug but I think the behaviors are the same from .exe)

  1. When STT is first activated, it will always start at (B) although there is no voice input. And it will stuck at (B) until speak several times. (yes, I waited sufficient time until ggml model loads) And when it unfreezes from (B), it will output several strings bunched up that I spoke into.
  2. After the initial hiccups, it will become more responsive. However, sometimes (A), (B), (C) will cycle through on it's own without any sound inputs.
  3. I then, wait until it stabilizes to (A), and then start speaking again. It will sometimes go to (B) immediately, and sometimes it doesn't. And it will start cycle through (A), (B), (C) on it's own again.

Here is another observation/question. I see the following logs in VS Console Output. image It seems to recreate the same threads infinitely. Can you please tell me what these threads are for? Perhaps the unstability is related to these thread constantly being recreated?

Many thanks.

VRCWizard commented 1 year ago
  1. The log will now let you know when Whisper is starting up and when it is actually ready for audio. Although audio recording during startup will still be processed after fully started as you observed. https://github.com/VRCWizard/TTS-Voice-Wizard/releases/tag/v1.5.1

  2. The states (A,B,C) should definitely not be random. I mentioned to you in discord that you can turn on Filtered Text Appears in Log to see when Whisper heard sounds that were not voices. Although for the state any sound picked up shows as "Listening, Voice". When "Listening, Transcribing" appears if you have Filtered Text Appears in Log enabled something should always appear in log.

  3. ^^^

  4. Threads exiting What does it mean: https://stackoverflow.com/a/12410591 How to remove the spam: https://stackoverflow.com/a/19199801

chrisk414 commented 1 year ago

Thanks for the info. I'll let you know once I know if I can find ways to improve.

BTW, regarding #4, I understand it's thread exiting. What is the thread about? It doesn't tell me anything about the thread itself. If you can point out the thread on the source, I'll take a look to understand it better. Perhaps, I was thinking... it might be better if the thread is to stay on the loop instead of exiting if it were to recreate itself infinitely??