coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
35.8k stars 4.38k forks source link

Fine-tuning for custom voice not giving results #1215

Closed DesiKeki closed 2 years ago

DesiKeki commented 2 years ago

Discussed in https://github.com/coqui-ai/TTS/discussions/1208

Originally posted by **DesiKeki** February 7, 2022 Hello TTS Community, I am trying to fine tune tts_models--en--ljspeech--tacotron2-DDC_ph for my own voice. I see that documentation https://tts.readthedocs.io/en/latest/finetuning.html gives some instructions to get it working, like using a smaller learning rate and a minimum of 100 audio samples etc. But I would like to know from the experience of the community what are the recommended _do's_ and _don't's_ for good results. For example, having answers to following questions can be really helpful for newbies like me: 1. Are 100 audio samples sufficient? 2. What should be the appropriate learning rate? 3. What is the recommended length (no of characters) of each audio sample? 4. Should we make any changes to batch size and number of epochs also? (Currently defaults in config file are 64 and 1000 respectively) 5. Any other changes which should be done in the config.json file? 6. Is tts_models--en--ljspeech--tacotron2-DDC_ph a good model to fine tune for custom voice? These are the doubts before starting the training. And while the training is happening, it continues for a while. So what are the things that should be noted/monitored to ensure that everything is going fine. Eg in Evaluation performance metrics,.. what is the acceptable average loss or acceptable average align error etc. Thanks in advance! Keki

============================================================

Originally posted by DesiKeki February 8, 2022 An update (not a very good one) on my above post is that I tried fine tuning tts_models--en--ljspeech--tacotron2-DDC_ph using the suggested method:

CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py \ --config_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/config.json \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth.tar But the resulting best model did not perform at all. It only generated noise for every text. I used 100 audio samples as my training data. Reduced the learning rate to 0.0001 Trained for default 1000 epochs

The training data and the config.json file can be accessed at https://cutt.ly/VOZ0wB5

I'll really appreciate if anyone can please tell what am I missing here?

Thanks Keki

Edresson commented 2 years ago

This is being discussed in #1208 I will close this issue for now if you need anything you can reopen it.