I - Githubissues

remghoost commented 1 year ago

So, I too am on the hunt for something like this.

tortoise-tts seems decent, but even the "fast" variant takes around 10 seconds for a 3 second audio clip. Which is probably usable, I just haven't gotten it to actually work. I tried a month or so ago and gave up. Tried again tonight and no dice. Both of the repos are abandon-ware as well. If you get it to work, please let me know. Might try again at some point.

I've recently stumbled upon Mycroft, which seems to be an entire platform for this sort of thing. Updated in the past few weeks as well. I haven't tried it yet though.

If I had any experience with torch, I'd be into making this sort of thing, but I severely doubt my 1060 6gb would be up for training voice models. It can't even train SD models. lol.

Anyways, just wanted to pass along some information I've found on this topic.

I have a dream of piping my voice through Whisper (for voice to text), into LLaMA via the Oobabooga repo (the new poster boy on the block for ChatGPT clones), then back out through a cloned voice via tts.

Essentially to have a locally run virtual assistant that I could talk to. Pair that with an image recognition suite for LLaMA (which could more than likely be done) and you'd have a voice controlled computer. Sure, auto hotkey and such exist, but you can do so much more with LLMs than you can with simple pre-programmed hotkeys.

Best of luck on your search! Let me know if you find anything that works for you.

remghoost commented 1 year ago

You can shoot me a message on Reddit if you'd like. I'm not too active on Discord anymore. Here's my Reddit account.

Just finishing up one of the pieces for that project. Got whisper to work in "real-time" by holding a key, recording the mic input to a temp file, transcribing that with whisper, then "typing" it out using pyautogui.typewriter. I'll probably put it on github in the next day or so.

AJMarsh1 commented 1 year ago

I have a project I'm working on where I'd like to train a model similar to this on a voice sample, and then have it speak dynamically generated text. Higher quality than this would be great, but the uncanny-value aspect of mid quality voice cloning is also fine actually. I've got a 99% completed CSCI degree and DS certificate, and I'd be down to develop a basic solution since we're all looking for an open source version of the same thing basically. My discord is Lego#3891, and my reddit is... CrabbyAlmond. Had to dig deep for that lol

SantosXP commented 1 year ago

https://github.com/VOICEVOX/voicevox_engine this project is the only one i found, that is better than elevenlabs, but it works in japanesse only

diep1920 commented 1 year ago

I have them same question for 11lab voice conversion, example: [https://www.youtube.com/watch?v=17_xLsqny9E]

remghoost commented 1 year ago

Okay. I have found a solution. https://github.com/gitmylo/audio-webui

Training is straight forward, the webui is Gradio (like A1111), and I'm getting pretty decent generations on models trained off of 7 seconds of audio at 300 epochs. Though it's not amazing. Could be that I haven't figured out how to "prompt" for it yet and have to tweak some settings.

It only has like 100 stars, but it was updated last week. Seems really promising and also seems to do everything I want it to (at the moment).

It has some UI quirks that I'm not the biggest fan of (in regards to sending models trained between tabs, and needing to generate base audio before running it through the voice cloning), but the core functionality is there.

Easy to setup. git clone https://github.com/gitmylo/audio-webui Literally just use the run.bat in it.

-=-=-=-=-=-=-=-

It also has 40-something API endpoints, so it would be fairly easy to handle all of the API calls with another "overlord" interface.

I have a working "real-time" whisper app working on my computer that I've made (and it doesn't seem to crash my audio interface anymore haha), so that integrated with KoboldAI using a Wizard LLM (which could be side-fed into stable diffusion for image generation), then that output to this "audio-webui", and play the audio. That's the loop. haha.

Plus you could have "watchers" for keyboard commands to tie it all into the Windows API to control your computer that way too. Actually, it could be really solid for accessibility.....

Sure, you can do all of these things with Google API sorts of things, but this can be entirely hosted locally. What a fascinating time to be alive.

Anyways, if someone steals this idea and beats me to the integration, at least tag me in it! haha.

I'll release the whisper "real-time" code on my github at some point. It's pretty handy. It's just a hotkey (F9, right now) that you hold, talk, then release. It keeps the audio buffer, runs it through whisper, then outputs the text. I also have it so it "types" it out for you. Pretty handy when trying to tell ChatGPT some specific long prompt.

CorentinJ / Real-Time-Voice-Cloning

I #1180