Voice chat - Githubissues

SpardaKnight commented 1 year ago

Feature request. Would it be possible to add elevenlabs_tts and whisper_stt to the discord bot so you could talk to it and have it reply with voice? I know these features work with the web ui.

chrisrude commented 1 year ago

Yup, it's on the roadmap! (You know, the one in my head).. It had been dependent on me getting more hardware, which just happened to show up yesterday!

I'll likely want to get some bugfixes and plugin integration working first, but I plan to get started on it in not too long.

SpardaKnight commented 1 year ago

You are a legend! Thank you for writing this code. Cant wait.

chrisrude commented 1 year ago

Thanks! :). Let's keep this open until I get it in, for tracking.

jmoney7823956789378 commented 1 year ago

Worth noting there are open source voice generation programs too, such as so-vits and tortoise-tts. I haven't touched much of it since around march so I'm not certain if this would be worth the trouble, especially for our singular hero dev, but I'm all for completely self-hosted solutions.

chrisrude commented 1 year ago

Always welcome to have others contribute as well! I don't have to be a hero, just trying to get it off the ground, really. :)

Personally what I want is an OSS solution, and open to package suggestions. We'll need to include packages for:

stt
tts
(possibly) wakeword detection, but only if the tts solution is too expensive to run continually

So far it seems like OpenAI's whisper is the best option for STT, and is MIT license. https://github.com/openai/whisper

There were a few other options I looked at as well, such as https://github.com/mozilla/DeepSpeech and https://github.com/coqui-ai/STT (by some of the same people that did DeepSpeech). But it seems CoquiAI has decided not to invest any more in STT, or so they told me. Other options didn't have a great accuracy rate, at least as far as my testing went.

An alternative to whisper would be ESPNet (https://github.com/espnet/espnet), but I haven't played around with that yet. It's under Apache 2.0, so the license works out.

In terms of TTS, tortoise-tts has the best results I've seen of any of the OSS projects. It's Apache 2.0 as well. I believe ESPNet may also offer something here.

Commercial offerings produce better-sounding TTS results, at least to me, but I'm not happy depending on them for a few reasons:

this project is all about self-hosting
the space is changing so fast that I suspect a year from now commercial offerings will be completely different, so it's a moving target at best

That said, it might make sense further down the line to also support 3rd party offerings for either of these features, simply because of the hardware requirements. Running Oobabooga + whisper + tortoisetts may be a pretty demanding on a single-gpu system, especially if inference needs to happen fast enough to engage in a live voice conversation. Hence my desire to wait until I had more hardware available myself.

But for now, I would like to focus on OSS, self-hosted offerings for this feature.

Would welcome any thoughts or experiences in this area that others have!

jmoney7823956789378 commented 1 year ago

I agree totally with your words on TTS. I've actually been messing with the STT whisper options on oobabooga a little and I'm not disappointed at all. Not sure how you'll implement the listening and responding... unless you plan on having it join a VC, listen, respond, like an always available assistant (or nuisance, like one named neuro-sama).

Personally, I am using an AMD MI60 (32GB) GPU for my text generation, image generation, and voice generation, sitting at about 53% VRAM usage with a 30B 4bit model loaded. I don't have any issues running all these demanding models, but I understand my case may be an outlier.

I haven't heard or used espnet, I'll test it out see if I see results.

sidonsoft commented 1 year ago

Bark TTS can sometimes produce very good results and is open source. https://github.com/suno-ai/bark

I have played with this one a little, I think oobabooga might have implemented some support.. I am not sure I will have to look into it. I have heard of another TTS that is practically real time

jmoney7823956789378 commented 1 year ago

Tortoise-TTS has an API on it too. Will note, with the "Ultra Fast" preset and half precision, I can manage 5 words in 5.5 seconds. It's not quite going to keep up 1:1 with text generation, but it doesn't have to. If only it would start generating voice at the same time text comes out.

chrisrude commented 1 year ago

Making some progress!

Things I've learned:

discord.py only implements sending audio, but not receiving / decoding audio. It could be done, but it looks like there are better other libraries which already work. In particular, songbird looks nice, and I have it able to parse audio when oobabot has joined an audio channel.
tortoise-tts is slow. Will need a different library for actual production use. Punting on this decision for now.
Whisper is terrific -- impressively accurate and fast. Whisper.cpp is very impressive too. It's only a few MB of footprint (as opposed to ~GB for the torchaudio setup that GPU-whisper uses). And it's actually pretty fast. Going to try and use that, as it would let the whole setup work on a single-GPU system. It can run on a huge variety of systems, from mobile phones on up, with extra acceleration for Apple silicon macs, NVIDIA GPUs, and OpenCL GPUs.

Current plan is to add a binary component which will be the "driver" part of Songbird plus Whisper.cpp. Its job will be to take in audio streams, send them to whisper.cpp and generate a transcript. It will also be responsible for playing the response output, however that gets generated. The MVP version will just likely use something simple, to be improved later ofc.

I'll take a look at Bark, etc. when I get to that stage. Thanks for the recommendations!

chrisrude commented 1 year ago

And... the very first speech-to-text where oobabot is listening to the audio from a discord voice channel is working!

Urammar commented 1 year ago

Is this actually implemented now? Any updates from a few weeks back? Would love this feature

Urammar commented 1 year ago

Guessing this bad boy cant post audio to the chat itself? How do you have it running in a voice room, then?

chrisrude commented 1 year ago

Update:

I've been hard at work on this. It's a little tricky to get right, and took a few rounds to get to a place where I'm happy with.

Progress as of 6/22

[✅ done] bot can join and leave voice channels
[✅ done] bot can transcribe audio
[✅ done] bot transcribes audio into text reliably
[✅ done] bot can generate text responses from spoken wake words
[🚧 in progress] integrate voice work into oobabot-plugin
[todo] initial Text To Speech to post voice replies, using system text-to-speech (e.g. osx "say")
[todo] fancier TTS using tbd AI-driven fanciness
...
[todo] profit! (I mean, release for free)

Most of the work so far has been happening in https://github.com/chrisrude/discriviner and on the oobabot-audio fork in this repo.

Discrivener is a separate CLI app that joins the Discord audio channel and transcribes audio. It runs realtime on CPU only, so can work on single-GPU systems. It's written in Rust, rather than python, for performance reasons. The underlying transciption is done by whisper.cpp.

chrisrude commented 1 year ago

If you want to check out what's been done so far on the python side, I just merged the audio branch back into main. https://github.com/chrisrude/oobabot/pull/52

chrisrude commented 1 year ago

For the super intrepid, there are bleeding-edge binaries available at github.com/chrisrude/discrivener.

Here's a brief preview of how to test this out. Currently, only Intel-based Linux and OSX are supported.

Download a `discrivener-json` binary

Currently there are binaries for 64-bit x86 Linux (kernel 3.2+, glibc 2.17+) as well as 64-bit Intel macOS (10.7+, Lion+). You can download them here.

Download and extract espeak-ng-data.tar.gz

From the same page, download espeak-ng-data.tar.gz and extract it under /usr/local/share. You should wind up with a /usr/local/share/espeak-ng-data with a bunch of *_data files in it.

Download model

Download a model of your choice from https://ggml.ggerganov.com/. I recommend ggml-model-whisper-base.en-q5_1.bin as a starting point.

Configure `oobabot`

First, upgrade oobabot to version 0.2.0 (or later).

If you're using the command-line oobabot, you'll want to regenerate your config.yml with

mv config.yml config.yml.orig && oobabot -c config.yml.orig --gen > config.yml

If you're using the oobabot-plugin, just upgrade to version 0.2.0 of the oobabot-plugin, then switch to the Advanced tab.

Now, in your config.yml, under discord:, there should be two new keys:

discrivener_location -- set this to the full path and filename of the discrivener-json-* file you downloaded discrivener_model_location -- set this to the full path and filename of the ggml-model-*.bin file you downloaded

Restart oobabot, or else restart the oobabooga service (needed to show the new 'Audio' tab).

Your bot should restart and if everything works, you should see something like:

2023-07-07 03:40:50,320  INFO Discrivener found at ...some path.../discrivener-json
2023-07-07 03:40:50,320  INFO Discrivener model found at ...some path.../ggml-base.en.bin

And then later:

2023-07-07 03:40:53,224 DEBUG Registering audio commands
2023-07-07 03:40:53,773  INFO Registered command: lobotomize: Erase (bot)'s memory of any message before now in this channel.
2023-07-07 03:40:53,773  INFO Registered command: say: Force (bot) to say the provided message.
2023-07-07 03:40:53,774  INFO Registered command: join_voice: Have (bot) join the voice channel you are in right now.
2023-07-07 03:40:53,774  INFO Registered command: leave_voice: Have (bot) leave the voice channel it is in.

If this is your first time, it may take 5 to 10 minutes for Discord to recognize the two new commands: /join_voice and /leave_voice.

Then, to test:

join a discord voice channel with your user account. This currently must be a voice channel in a Discord server that the bot has access to
under that same account, send the bot the command /join_voice. It doesn't matter what channel you use to send this command from.

The bot should look for the voice channel you're currently in and join it.

The bot will then stay in that channel until it receives a /leave_voice command, or until the bot is stopped.

Talk to the bot!

If you're using the oobabooga-plugin (recommended), you'll now see audio logged to the "Audio" tab under a few second of latency.

If you are the only other person in the channel, the bot will respond to every message it hears.

If there is more than one human in the channel, the bot will only respond if it receives a wakeword, and then with a very low percentage chance afterwards. This is still a work in progress.

chrisrude commented 1 year ago

Example of what the transcript UI looks like (using oobabot-plugin):

This is what the bot is hearing and saying, the transcript just helps you see what's going on.

TruthSearchers commented 1 year ago

Thanks @chrisrude for this It's soo coool bro I will try it out

chrisrude commented 1 year ago

btw, I've been having occasional issues using the uploaded discrivener binaries. If you're seeing weird audio glitching, just let me know and don't spend too much time on it, it's likely a problem with the compilation process.

Things work fine when I build locally, but it's harder to make the binary work well on distribution.

TheMeIonGod commented 1 year ago

For TTS I have been messing around with silero-api-server and it has been pretty solid.

jmoney7823956789378 commented 1 year ago

For TTS I have been messing around with silero-api-server and it has been pretty solid.

I heard it performs pretty good, but it's still a non open source platform.

TheMeIonGod commented 1 year ago

For TTS I have been messing around with silero-api-server and it has been pretty solid.

I heard it performs pretty good, but it's still a non open source platform.

What do you mean? Like the models?

jmoney7823956789378 commented 1 year ago

For TTS I have been messing around with silero-api-server and it has been pretty solid.

I heard it performs pretty good, but it's still a non open source platform.

What do you mean? Like the models?

Sorry, I was thinking of something else. I just looked at it and I'll test it out tomorrow if I get the chance. Looks promising, what's the delay like? Any options for custom trained voices?

TheMeIonGod commented 1 year ago

For TTS I have been messing around with silero-api-server and it has been pretty solid.

I heard it performs pretty good, but it's still a non open source platform.

What do you mean? Like the models?

Sorry, I was thinking of something else. I just looked at it and I'll test it out tomorrow if I get the chance. Looks promising, what's the delay like? Any options for custom trained voices?

Yeah someone made a server API for Silero to use to use in Silly Tavern since TGW doesn't stream audio. On a Xeon E5-2430 (6 core 12 threads) which is from 2012 I get about 8x time real time speed (1 second for 8 seconds of audio). Custom voices are theoretically possible but I have heard you need 5 or more hours of good audio to train it. It comes with 117 English voices though.

jmoney7823956789378 commented 1 year ago

Looks like it works pretty good, actually. Unfortunately it currently breaks the bot if it's enabled in webui.

TheMeIonGod commented 1 year ago

Looks like it works pretty good, actually. Unfortunately it currently breaks the bot if it's enabled in webui.

In the WebUI having it enabled breaks anything that uses the API. So, that is why you use the actual API server not the one that comes with the WebUI. https://github.com/ouoertheo/silero-api-server

jmoney7823956789378 commented 1 year ago

Looks like it works pretty good, actually. Unfortunately it currently breaks the bot if it's enabled in webui.

In the WebUI having it enabled breaks anything that uses the API. So, that is why you use the actual API server not the one that comes with the WebUI. https://github.com/ouoertheo/silero-api-server

I gotcha, so the API server is only sent text by sillytavern running on the client device (currently), but this functionality could definitely be put into oobabot. This is pretty promising, I'm wondering how far @chrisrude is in their current implementation of the other TTS/STT solution.

TheMeIonGod commented 1 year ago

Looks like it works pretty good, actually. Unfortunately it currently breaks the bot if it's enabled in webui.

In the WebUI having it enabled breaks anything that uses the API. So, that is why you use the actual API server not the one that comes with the WebUI. https://github.com/ouoertheo/silero-api-server

I gotcha, so the API server is only sent text by sillytavern running on the client device (currently), but this functionality could definitely be put into oobabot. This is pretty promising, I'm wondering how far @chrisrude is in their current implementation of the other TTS/STT solution.

I have made a couple of things using it like a Alexa like thing and a endless conversation bot. I have it so the is audio streamed directly from the server so there is no temp file. All it needs is the text to made into speech, the language, and what voice it is going to use (It also takes "random" as a speaker). Plus of course what the host is too which makes it so you can run it on another separate PC. I am not sure how Discord's API handles audio playback but I feel like it would be a good fit for this project as it is fast and not bad sounding. Silero also have a lightweight STT that I have used but if you are running it on a server, Wisper would likely be better in terms of not as high error rate.

jmoney7823956789378 commented 1 year ago

I am not sure how Discord's API handles audio playback

The dev is currently implementing his own playback, but it's the same idea as music streaming bots on discord. You can have them stream audio directly to the voice channel with no problem. Definitely looking forward to seeing if this gets added, since the voice generation is very quick!

nortenootaku commented 7 months ago

I am not sure how Discord's API handles audio playback

The dev is currently implementing his own playback, but it's the same idea as music streaming bots on discord. You can have them stream audio directly to the voice channel with no problem. Definitely looking forward to seeing if this gets added, since the voice generation is very quick!

Would it be possible to add https://github.com/serp-ai/bark-with-voice-clone to the discord bot so the bot could answer with custom voices?(also windows support because is only linux and mac)

chrisrude / oobabot

Voice chat #32

Download a `discrivener-json` binary

Download and extract espeak-ng-data.tar.gz

Download model

Configure `oobabot`

chrisrude / oobabot

Voice chat #32

Download a discrivener-json binary

Download and extract espeak-ng-data.tar.gz

Download model

Configure oobabot

Download a `discrivener-json` binary

Configure `oobabot`