FunAudioLLM / CosyVoice

Multi-lingual large voice generation model, providing inference, training and deployment full-stack ability.
https://funaudiollm.github.io/
Apache License 2.0
6.57k stars 707 forks source link

Fine tune for English TTS #150

Open rishabh-cruv opened 4 months ago

rishabh-cruv commented 4 months ago

Hello! Thanks for this amazing open-source model. I have tried to infer TTS using few reference voices (in english). Though it sounds nearly accurate but the words sometimes get misspelled, miss or get extra. How can I fine-tune this part?

aluminumbox commented 4 months ago

well, in this case you can try use some other words with same pronuncation to replace it. of course finetune will make it better because the model may haven't seen such words when training. follow example/libritts/run.sh for finetune speaker

rishabh-cruv commented 4 months ago

I'm looking to fine-tune ZeroShotTTS (3s reference), not for a specific speaker. Would this be enough example/libritts/run.sh to get good results? (Overall I want to improve TTS in English and use anybody's voice using zero shot)

rishabh-cruv commented 4 months ago

@aluminumbox Any guidance on how to improve this would be appreciated.

rishabh-cruv commented 4 months ago

@aluminumbox Hello, I'm unable to fully utilize my GPU, it always use 23GiBs out of 40. Also, is there any way to resume training once stopped?