Unable to get Speech 2 Text running smoothly in VRchat

Sharrnah / whispering

Whispering Tiger - OpenAI's whisper (and other models) with OSC and Websocket support. Allowing live transcription / translation in VRChat and Overlays in most Streaming Applications

MIT License

395 stars 28 forks source link

Unable to get Speech 2 Text running smoothly in VRchat #15

Closed DillonSimeone closed 1 year ago

DillonSimeone commented 1 year ago

I'm completely deaf, so I was hoping that this would give me a way to see what the heck VRchat players are talking about in front of mirrors.

My computer specs: CPU - Ryzen 5 3600XT GPU - 3070Ti RAM - 32GB at 3600Mhz

I have the CUDA Toolkit installed, and have tried selecting CUDA for the speech to text, with VB-input audio set as the input, which seems to be getting audio sources from VRchat since the orange bar moves when VRchat have people talking in the background. Most of the time the speech detection don't even get tripped. It just goes processing... Processing... No output!

I've tried small and medium size. Large size lagged me out.

I can see it working so much faster in this gif from your plugin: https://user-images.githubusercontent.com/55756126/236357319-8769c88d-f9bb-492c-8be8-89a20e521792.gif

I'm like, what witchery is this? My experience is nothing like this so far.

Usage attempt:

Start up VRchat

Start up Tiger Whisper UI

Create new default profile win Tiger Whisper, select speech 2 text model and touch nothing else

Stare at the speech detection going off randomly in the UI without any output 90% of the time

I've tried tweaking with the settings randomly to no success. I'm not sure where the problem is. Can you share your best VRchat speech 2 text profile.yaml?

Sharrnah commented 1 year ago

Hello, i am aware that the UI sometimes can lag a bit behind. I will investigate that further. But the output of the Plugin should not be lagging since it runs directly by the Python app without the Websocket messaging in-between.

There are some Advanced -> Settings that can speed up realtime transcriptions. These are:

Disable temperature_fallback and realtime_temperature_fallback
Disable condition_on_previous_text
Reduce beam_size and realtime_whisper_beam_size to 2 or 1 (this reduces accuracy though)
Reduce realtime_frequency_time. (In Seconds. if you set this too low, it might have a negative effect and might be even slower if your GPU can't keep up. So try maybe 0.4 or 0.3)
maybe reduce realtime_fame_multiply. This is the sample size when the actual realtime processing starts or is rerun (to help the AI to not process too small audio snippets)
maybe increase the whisper_num_workers. haven't noticed much difference here (if at all). but could be a try.

Also in the example you mention, i am running the translation Task of the Speech 2 Text A.I. so it does not have to run additionally through the Text Translation A.I. But Speech 2 Text can only translate into English. You can set this in the Speech Task Dropdown in the Speech 2 Text Tab.

If you are using Virtual Realtiy, i noticed that Windows likes to prioritize the Game window if it is in Focus. You can try to Focus the Whispering Tiger window, disable the Game-mode of Windows or change the Windows Graphics Settings for the Application.

In VRChat, you might have better results if you disable the background music as well. If it skips parts, reducing vad_confidence_threshold can help.

Also be aware that too many people speaking at the same time, can also result in issues for the A.I. There is no good solution around it as far as i am aware. (already playing with speaker diarization to improve this). In VRChat, using the Earmuff mode can help with this.

Let me know if these tips helped.

DillonSimeone commented 1 year ago

Sorry that I didn't mention my headset. I'm using the Quest 2 via Virtual Desktop connected to my desktop. Works flawlessly for me in PcVR games.

All right. I've just looked over your guides and documents again.

I just saw that you wrote VR vrchat players are supposed to use audioWhisper\audioWhisper.exe --model medium --task transcribe --energy 300 --osc_ip 127.0.0.1 --phrase_time_limit 9

I get interesting outputs from that. I'm ignoring the first few errors since I don't want OCR. Just speech to text. error.txt

There appears to be something missing in the .cache folder:

Do that folder seem fine to you?

Aside from that, this is how my sounds are set up, in case it's that. I can't check with my ears, so...

It looks right to me, expect that I've managed to somehow configure Cable Output to also play through my speakers... I'm careful to stay muted in VRchat, so I think I can ignore that.

Trying from the UI side, I can see the Audio Input bar flicking back and forth with VRchat running in the background, so I assume it's working.

Hm. Something changed since I slept; the whisper model is now getting hung up on loading. I've waited for five minutes with no change. Yesterday it was loading in a few seconds.

I looked through the folder and didn't see any log files that I can attach to this for debugging purposes. When I get back home from work tonight, I'll try deleting Whisper UI and reinstalling, I'll check this thread before doing that in case the event logs are saved somewhere else outside of the folder.

Once I get the whisper model loading again, I'll apply your recommended advanced settings.

I can make a new Discord account and hop into Discord if you prefer, by the way.

Sharrnah commented 1 year ago

Hi.

_I just saw that you wrote VR vrchat players are supposed to use audioWhisper\audioWhisper.exe --model medium --task transcribe --energy 300 --osc_ip 127.0.0.1 --phrase_timelimit 9

I never said i recommend running Whispering Tiger without the UI. Actually i would recommend using the UI since it makes it much easier to configure and use. Where did i write that? Guess i need to make that more clear then. 😬

Do that folder seem fine to you?

Folder looks fine to me.

Hm. Something changed since I slept; the whisper model is now getting hung up on loading. I've waited for five minutes with no change. Yesterday it was loading in a few seconds.

Can you hide the loading dialogue and go to "Advanced -> Log"? If something failed, it should definetly show up there. There you can also enable writing a log file. (Just don't forget to disable again when everything works, or the file will grow indefinetly)

You can also join the Discord Server i created, if you are interested. https://discord.gg/V7X6xa2B2v

DillonSimeone commented 1 year ago

The commandline I brought up is from this repo, not Whisper Tiger! Turns out that I'm just getting my wires crossed between the two repos. Link

Ok, if we're not supposed to use the command line like that with the UI, I'll have to mention that subtitling did not appear in VR with OSR enabled for me like it did for your gif.

Luckily, I just went straight to bed after work, so I still have the somehow broken Tiger Whisper sitting in my system!

... Ah, here we go. The whisper AI model somehow got corrupted. Strange!

Sent the .cache to /dev/null.

Hm, you can see what's happening for me on start-up: https://www.youtube.com/watch?v=24ml9Gef4-8&feature=youtu.be Most of the time was spent waitting for downloads to complete, only for them to break. Eh, do the sounds sound right to you? It looks like sounds was recorded, I haven't really touched audio stuff before, so I'm not sure.

I closed everything down, and reopened Tiger Whisper and got this.

It seems to be stuck at 0% now.

Hm, I purged the entire Whisper Tiger UI folder and restarted from step 0, and got this error:

Also stuck at 0% now:

Did the download servers blow up?

Sharrnah commented 1 year ago

Hi. Sound seems to be echoing a lot. Maybe there is some routing going on that comes from the game, goes into the Audio Virtual Cable and back into the PC audio again etc. But could also just be a recording issue. 🤷

Also the error in the youtube video i am aware of and i hope is fixed in the next update. But it should work if you restart the app. It should continue where it stopped with the download.

Not sure about the tls error though. I tested the URL and it works for me. My guess would be some form of Proxy or Firewall that does something odd with secure HTTP connections. But then i would guess it should not have worked before. So maybe one of the servers had some issue. I will look into that it retries on these errors as well which hopefully fixes that as well.

It also seems you have enabled the realtime model in Advanced -> Settings. grafik

Thats not requred for Realtime mode and mostly just eats more RAM. Its more of an experimental feature to run 2 Speech2Text A.I's at the same time to improve realtime performance. (at least if the realtime model is smaller then the model used for the final transcription)

Maybe i broke something there. will test that again and fix it.

//Edit: Also the GIF you probably mention is not recorded in VR. To show Subtitles in VR, you can use a HTML overlay from inside the websocket_clients Directory. Add one you like in Desktop+ as Browser with transparency support.

You can find more info here: https://github.com/Sharrnah/whispering/blob/main/documentation/usage.md#desktop-currently-only-new-ui-beta-with-embedded-browser

Sharrnah commented 1 year ago

@DillonSimeone where you successful in the meantime?

Do you need any more help?

The last update might improve the downloading a bit and the issue when using a seperate realtime whisper model when that model size is not downloaded is fixed as well.

DillonSimeone commented 1 year ago

Ooh, a new update!

All right. Hmm. You can see me wrestling with the GUI here: https://www.youtube.com/watch?v=irwYp8AJxS0

It occurred to me that using a youtube video would be easier for testing.

I couldn't get it going fast enough on default settings, and even with your suggested settings, it did not really change things. Did you see how me muting the cable input suddenly fixed everything? Weird. That was just a random thing I tried.

Sharrnah commented 1 year ago

Thank you for the Video.

You had the loopback device selected. Loopback devices are only useful if you don't do audio routing using voicemeeter. So you can now just select your normal PC audio device with [Loopback] in its name to record your PC Audio without any additional software like voicemeeter or Audio Virtual Cable and you don't need to change your Windows default audio Device anymore.

Also you never enabled Realtime mode as far as i have seen. Without that enabled, it will always wait to process the audio when it noticed a speaking pause or hit the timelimit (how you configured it).

and lastly, i think you are confused about the text overlapping on the right side. Thats a known issue i have with the UI Framework i use. In theory its fixed in the new UI Framework version, but the new version is too memory hungry so i kept it on the old version for now. I might change to a different UI Framework in the future. Haven't decided on that yet.

For a better text display, you can use the Subtitle Plugin or open one of the overlays in your browser. You can find the overlays in the application folder inside the websocket_clients/ folder.

btw. you had Text 2 Speech enabled. So it always started talking the transcription. Could be since you used the Audio Virtual Cable as Default Windows Device and had the Text to Speech on Default as well, that it also transcribed the TTS again etc.

But that should not be an issue anymore if you revert back your Windows audio config to use your regular Audio device and just use its Loopback representation in Whispering Tiger. (But maybe still disable Text to Speech if you don't need it. :))

DillonSimeone commented 1 year ago

Whoa! It's starting to work! It still lags out a bit, but it's actually generating live captioning locally without needing a super computer! Amazing!

https://www.youtube.com/watch?v=1k6Y7qWyKuE&feature=youtu.be

I've disabled Text to speech, all right. Yeah, I'll use alternative methods for that, I have my eyes on this: https://github.com/suno-ai/bark The way you can just have it to laugh... My hearie families are now worried about being able to tell what's real and what's isn't. My advice to them? "If the fake is good enough, it doesn't matter!"

Yeah, switching UI frameworks is very painful. 100% understandable. I'll grab the plugin when I'm able to play games for a bit.

It sounds like I could tweak with the settings a bit more to get it faster. Hmm. I've just zipped up my .yaml and attached it below in case anyone wants to use it.

Nightmare.zip

Sharrnah commented 1 year ago

thats great.

For Bark you can use the Bark Plugin for Whispering Tiger. (resulting in a barking Tiger 🙈 )

https://github.com/Sharrnah/whispering/blob/main/documentation/plugins.md#list-of-plugins

it even has the better voice-cloning implemented (through a seperate project because of Python dependency hell), though i still need to document the plugin a bit better,

Bark is just sometimes a bit random like adding words and such things.

Edit: Had a quick look at your yaml file. what you most likely can try is using float16 precision which should be faster on CUDA then float32.

Sharrnah commented 1 year ago

@DillonSimeone I will close this. If you have any other question, feel free to open a new issue or you can also join the Discord Server.

as a last information. I updated the Bark plugin with a new Vocoder. So the audio quality should be way better. Also updated the example audio on the Plugin Listing.