erew123 / alltalk_tts

AllTalk is based on the Coqui TTS engine, similar to the Coqui_tts extension for Text generation webUI, however supports a variety of advanced features, such as a settings page, low VRAM support, DeepSpeed, narrator, model finetuning, custom models, wav file maintenance. It can also be used with 3rd Party software via JSON calls.
GNU Affero General Public License v3.0
871 stars 99 forks source link

[Questions] How to use finetune + enhancement functions + discord #108

Closed Jack74r closed 6 months ago

Jack74r commented 6 months ago

Hi Thank for this project and the documentation. It's not a issue but questions, idk if i can post here, sorry if not

I've succeeded to create a finetune model, but how do I select it? Do I have to replace the existing one?

Is it possible to use the Improve output quality and Resemble enhancement functions like text-generation-webui?

Is there a discord for this project?

erew123 commented 6 months ago

Hi @Jack74r

discord etc You're fine to ask questions here, not an issue! And no I don't have a discord, I don't have enough spare time to setup and maintain it, I do get quite a bit of questions on Reddit too, so between that and here, it keeps me busy enough,

Finetuned models With the finetune models, if you are using them within Text-generation-webui, you will get an option within the interface to load that model https://github.com/erew123/alltalk_tts?tab=readme-ov-file#-using-a-finetuned-model-in-text-generation-webui

image

If you wish to use the finetuned model generally, you will need to copy it over the top of the existing base model.

improved quality When you say improved quality, in what way do you mean? Higher Hz output e.g. 44100Hz etc? And when you say " Resemble enhancement functions like text-generation-webui?" I know of resemble enhance, but I dont know of any used in text-generation-webui, have you something you could point/link me to?

Thanks

311-code commented 6 months ago

Thanks for the info, I also have a question about this repo version. Currently I am using a custom elevenlabs v2 extension I made and the voice cloning is amazing in oobabooga. I stopped using the built-in base coqui tts because I could never get good results even after finetuning it (and was probably doing something wrong)

I am wondering if this extension I could maybe come close to elevenlabs v2 and maybe has any hidden stability, similarity, style, or "vocal boost"-like options in some file somewhere like the three elevenlabs v2 sliders. The elevenlabs v2 is getting too expensive, but my voice was perfectly recreated on there. Any help and info on this would be really appreciated.

Edit: I just followed your video and read documentation link and have my voice finetuning now, deleted original questions. I guess my only question would be the one related to the about a file to adjust variables similar to style sliders/vocal boost.

Update: I trained the model, tested it (already sounds better than vanilla to be, possibly because of the presets and easy to follow instruction) and chose first copy option. I'm having an issue though, I am in text generation webui, when I choose xttsv2ft it seems to just play the narrator voice. I disabled the narrator, I am also unsure of where to find where to choose sample 7, because that mostly sounded like me when I tested during the fine tuning. The finetune model and mp3s don't show up in the voice list when I refresh it.

erew123 commented 6 months ago

Hi @brentjohnston

You will need to go into the finetuned model folder \alltalk_tts\model\finetuned\ and in there you will find your wav samples folder. Simply copy the wav sample you want to use to the \alltalk_tts\voices\ folder. I dont set it to automatically move the samples over to this folder, because some people do crazy amounts of training and end up with 100+ voice sample files.

I appreciate I probably could tidy this process up a little or document it further (Ill make a mental note to do that one).

As for emotion control, within the XTTS model there is no direct way to influence it. I may add other models in future that do have those kinds of options.

When you say it just plays the Narrator, are you just using this in the standard chat window of Text-gen? With a character card?

Thanks

311-code commented 6 months ago

Thanks for the reply, I did get it working by coping the model over the xttsv2_2.0.2 folder and just using the XTTSv2 Local button. I was doing the narrator when it was disabled with the XTTSv2 FT selected at first. It seems like the XTTSv2 FT button wasn't linking to the newly created trainedmodel folder for some reason.

For that voices folder, I copied them all over but the originals are still in there like arnold, do those affect anything or should I be deleting them? Thanks a lot, I can already tell I will probably spend an entire day here tinking with the sample selection, training, temperature, and Repetition Penalty sliders haha.

If you know any lesser-known settings that affect the voice in a script somewhere (not mapped to gradio) would love to know of any other settings worth messing with.

Edit: Also I noticed whisper v3 option for the finetuning but a note about v2 being better, should I use still v2 then? Also for a 4090 what do you think would be optimal settings for epochs, batch size, grad accumulation steps, max permitted audio size if I have 15 minutes of clips over 10 files? Sorry for all the questions. Thanks again!

erew123 commented 6 months ago

Hi @brentjohnston

I'm just making a few small changes to the finetuning, mostly documentation to make things a bit more visible/clear/better documented for people. But otherwise nothing of any consequence.

Whisper 2 is the best one to use, in fact Im just setting it as the default. It appears to split down audio better than the v3 model which seems to favour generating larger output wav files, which isn't always ideal for training.

The only other things that you can play about with within the model are:

top_k: Lower values mean the decoder produces more “likely” (aka boring) outputs. Defaults to 50. top_p: Lower values mean the decoder produces more “likely” (aka boring) outputs. Defaults to 0.8.

Though you are welcome to look here https://docs.coqui.ai/en/latest/models/xtts.html#tts-model-api

These are not very well documented, in fact what Ive given you above is about the extent of the documentation on them for the XTTS model. The only way to change these settings is within the config.json file which is stored alongside the model and once they are changed there, you would have to unload and reload the model (so its a bit of an annoying process).

As for the original sample files within the folder, they are perfectly fine to stay in the folder and be used for TTS generation.

Re the best settings for Finetuning, I've set the recommended settings within the finetuning, however, there is no absolute mathematical "these are the best settings". It depends on what you are training. If you are training an entirely new language, you need hours of audio and about 1000 epochs. If you are just training a normal human voice in say English, you will probably find the base settings work fine, though you may choose to go to 20 or 40 epochs, or even double train the model. If you were training say a cartoon character voice that is English but doesn't quite sound like a normal human voice, you may need 80 epochs, or until it sounds correct.

Just like training a human something new, there is no way to say exactly how long it will take to learn this new thing, to a level you find acceptable.

Typically, with a normal human voice, Id suggest try with 20 epochs at most and see if the model then sounds like the person you trained it on. If not, do a second round of training. 10 Epochs will probably do it though.

Hope that helps

Thanks

311-code commented 6 months ago

Thanks for that info, will try it out for sure. Also have an update, I just trained with whisper v2 this go-around and also used a few two minutes audio clips downloaded from the nearly perfect eleven v2 representation of my voice mixed in.

I loaded it with the original samples used on elevenlabs and trained over the first finetuning I did earlier and that one was "ok" for base model first try. This second trained one I just tested with the chat, this one is actually somehow better than elevenlabs v2, I am pretty mind blow right now. It has more emotion to the voice and sounds exactly like me. No idea how this repo hasn't blown up, you made it very easy. Or maybe it has and nobody has issues, haha.

Now the only problem I have though: The built-in sd_api_pictures extension doesn't seem to work right now when sending pictures to the chat via automatic1111 api. It read to me a 12 minute clip of jibberish characters a bit ago from the cmd window somehow (I think that jibberish is created from the picture generation). If I can just get that to work again this would be pretty much flawless.

It seems like there's some sort of compatibility issues where the chat says recording a message and just sits there and the picture does eventually come through but the audio is not reading the right thing.

erew123 commented 6 months ago

The SD issue is due to how Text-gen passes things around in the backend. Ive had a long chat with someone else on this https://github.com/erew123/alltalk_tts/issues/69

Ive yet to setup stripping out and pushing back in the images, which does/should clear up the issue. Just too many things going on at that time with other bits to make the change and resolve it.

Ill try get it done sometime soon.

erew123 commented 6 months ago

Hi @Jack74r

Did my earlier response help you or do you need more information?

Thanks

Jack74r commented 6 months ago

Yes thank you

311-code commented 6 months ago

Ive yet to setup stripping out and pushing back in the images, which does/should clear up the issue. Just too many things going on at that time with other bits to make the change and resolve it.

Thank you that would be awesome, the reason I would use this personally is I am dreambooth training images of a product for testing with like 80 photos, it would be really great if could request an image from the chat and it pops up.

I'm trying to test this concept out and also finetune a local llm on my company data and use my voice for the chat. Thanks a lot.