Originally posted by **DesiKeki** February 7, 2022
Hello TTS Community,
I am trying to fine tune tts_models--en--ljspeech--tacotron2-DDC_ph for my own voice. I see that documentation https://tts.readthedocs.io/en/latest/finetuning.html gives some instructions to get it working, like using a smaller learning rate and a minimum of 100 audio samples etc.
But I would like to know from the experience of the community what are the recommended _do's_ and _don't's_ for good results. For example, having answers to following questions can be really helpful for newbies like me:
1. Are 100 audio samples sufficient?
2. What should be the appropriate learning rate?
3. What is the recommended length (no of characters) of each audio sample?
4. Should we make any changes to batch size and number of epochs also? (Currently defaults in config file are 64 and 1000 respectively)
5. Any other changes which should be done in the config.json file?
6. Is tts_models--en--ljspeech--tacotron2-DDC_ph a good model to fine tune for custom voice?
These are the doubts before starting the training. And while the training is happening, it continues for a while. So what are the things that should be noted/monitored to ensure that everything is going fine.
Eg in Evaluation performance metrics,.. what is the acceptable average loss or acceptable average align error etc.
Thanks in advance!
Keki
Originally posted by DesiKeki February 8, 2022
An update (not a very good one) on my above post is that I tried fine tuning tts_models--en--ljspeech--tacotron2-DDC_ph using the suggested method:
CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py \
--config_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/config.json \
--restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth.tar
But the resulting best model did not perform at all. It only generated noise for every text.
I used 100 audio samples as my training data.
Reduced the learning rate to 0.0001
Trained for default 1000 epochs
Discussed in https://github.com/coqui-ai/TTS/discussions/1208
============================================================
Originally posted by DesiKeki February 8, 2022 An update (not a very good one) on my above post is that I tried fine tuning tts_models--en--ljspeech--tacotron2-DDC_ph using the suggested method:
CUDA_VISIBLE_DEVICES="0" python TTS/bin/train_tts.py \ --config_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/config.json \ --restore_path /home/ubuntu/.local/share/tts/tts_models--en--ljspeech--glow-tts/model_file.pth.tar But the resulting best model did not perform at all. It only generated noise for every text. I used 100 audio samples as my training data. Reduced the learning rate to 0.0001 Trained for default 1000 epochs
The training data and the config.json file can be accessed at https://cutt.ly/VOZ0wB5
I'll really appreciate if anyone can please tell what am I missing here?
Thanks Keki