Fine-Tuning Guide to Add a New Speaker

AI4Bharat / Indic-TTS

Text-to-Speech for languages of India

MIT License

130 stars 29 forks source link

Fine-Tuning Guide to Add a New Speaker #12

Open harshvardhan-truefan opened 1 year ago

harshvardhan-truefan commented 1 year ago

Hi, Really love the work done in this repo, it has been really helpful. Just a request, could you please add more documentation regarding fine-tuning the models for a new voice, using available model checkpoints. It is not very clear about how to fine-tune the model on a new dataset.

Thanks in advance!

Regards, Harsh

ShyamGadde commented 11 months ago

ultralegendary commented 9 months ago

h2210316651 commented 5 months ago

I'd like one too, I haven't explored the codebase yet but I think it's based in coqui TTS. Hence similar training and fine-tuning methods would apply is my guess. There are some resources online detailing how to add a new speaker for coqui. I'll explore further and keep you guys posted.

sachin7695 commented 4 weeks ago

@h2210316651 did you try to figureout the finetuning of indic tts?

h2210316651 commented 4 weeks ago

@sachin7695 I have explored coqui in depth and decided it's way too complex for me to be using it. I have switched to rvc V2 I just TTS the content I want synthesized, then i use rvc to change the speaker voice, all you need is a 15 minute sample for pretty good quality voice clone.

h2210316651 commented 4 weeks ago

@sachin7695 there's another project called applio, please check

sachin7695 commented 4 weeks ago

@h2210316651 i tried xttsv2 coqui for fine tuning on hindi language and i was able to do that, i explored that in depth but the thing is the fastpitch of indic tts and coqui tts fastpitch is somewhat different when i checked the model state thats why i am pretty intersted to know indic tts fine tuning. anyways thanks a lot for your reply.

sachin7695 commented 3 weeks ago

@h2210316651 bro i finetuned coqui xttsv2 for different indic language such as odia, bangla if at all you want to discover just ping me.. you just need to train the tokenizer (Byte pair encoding model) with indic transcription