RapidRabbit-11485 / PNGTuber-GPT

This is a custom C# action for Streamer.bot and Speaker.bot to add a GPT-based PNGTuber to your stream!
MIT License
9 stars 2 forks source link

Support voice triggers #6

Open RapidRabbit-11485 opened 10 months ago

RapidRabbit-11485 commented 10 months ago

Support the ability to trigger the bot like Alexa and use speech-to-text to route the request to GPT.

RapidRabbit-11485 commented 10 months ago

This likely needs to be written into a custom TTS client, and be a 2.0 thing. This can use the NAudio framework. I've looked into the native support in Speaker.bot and it's terrible. I've also looked into Voice Attack and .Net Speech Services. All of these options were pretty terrible at recognition of most trigger words. .Net Speech Services yielded the best results, when Windows 10 was actually trained in the specific keyword, on top of the normal training. However, it still only detected about 70% of the time. It seems like the best path forward is to use a Cloud-based Speech-to-Text API that supports streaming, and then sending it all of the speech and determining actions as the data flows back. Attempts to do most of the speech analysis locally and only sending the actual prompts after the trigger word have not yielded consistent results.

RapidRabbit-11485 commented 7 months ago

I have evaluated this from multiple angles. This is a tough nut to crack. There really isn't any software out there that handles voice triggers as well as Google Speech-to-Text. OpenAI's Whisper API is also interesting as well. The thing is we are just trying to detect the trigger words using this functionality, and this requires sending all the audio stream for the entire stream for processing, more like we were doing closed captions. This isn't the end of the world, mind you, but it is expensive for no good reason. Ideally a custom client app could fix this, if .Net Speech was better at detecting the trigger words, but in testing, it's not great. With training it's a little better, but we are still talking low 70s hit rates. Windows is just not Alexa. I imagine there are paid libraries that would do a better job, there has to be, but as an open source project we need to stay within the bounds of free and licensable. I'm still investigating options for this, and am open to suggestions from the community. This is definitely a 2.0/PNGTuber-GPT Pro thing though, I don't see this making it into the base solution without some other app being involved, as Streamer.bot's voice detection is absolutely horrible, because it's just based on .Net Speech / Voice Attack the same way.