CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.84k stars 8.8k forks source link

Pretrained Models Using Datasets Other Than LibriSpeech? #877

Closed Tomcattwo closed 3 years ago

Tomcattwo commented 3 years ago

Hello all, @blue-fish , I had very good success on my project to clone 14 voices from a computer simulation (samples available here ) using single-voice training (5000 additional steps) on the LibriSpeech pretrained synthesizer (295K) and Vocoder.

However, I would like to see if another model (in English) might provide better output reproducibility, and perhaps punctuation recognition and some better degree of emotion (perhaps with LibriTTS or some newer corpus that I am not aware of yet). Are you aware of any pretrained speech encoder/synthesizer/vocoder models built on another dataset that might be available for download? I tried building synthesizer and vocoder single voice training on the synthesizer LibriTTS model from your single voice training instructions, but only got garbled output in the demo_toolbox, probably due to the fact that the speech encoder was built on LibriSpeech and not on LibriTTS . Any info you or anyone else might have on a potential model-set download would be greatly appreciated. Thanks in advance, Tomcattwo

ghost commented 3 years ago

To date, no one has shared an alternative pretrained model that is compatible with the current (pytorch) synthesizer. If you're willing to switch back to tensorflow 1.x, there are a few in #400 including one model on LibriTTS. However, you can consider training from scratch on LibriTTS with the current repo, since you have experience with the single voice finetuning.

Tomcattwo commented 3 years ago

@blue-fish , thank you for the reply. If I were to try to train all three models (voice encoder, synthesizer and vocoder) from scratch, using LibriTTS, would you recommend using train-clean-100 or train-clean-500? My understanding from reading the doctoral papers and Corentin's remarks, is that for voice encoder you need lots of voices and quality is less important than quantity and for synthesizer and vocoder quality>quantity. If I were to do this, training the synthesizer alone would take a week, but I may give it a go.

Any hints, tips or settings for hparams you could share for such a project would be greatly appreciated. If I decide to try this, I would shoot for 300k steps to get to a 1e05 learning rate. Also, I have not tried any voice encoder training yet using this repo. Any helpful information or hparams for that evolution you could share?

I need to do a bit of research first on LibriTTS to see what it can and cannot do wrt punctuation. If it will be no better than the current LibriSpeech trained model, it may not be worth the time or effort. Your thoughts would be appreciated. Regards, TC2

ghost commented 3 years ago

You can reuse the existing encoder and vocoder models. When training the synthesizer, make sure not to change the audio settings in the synthesizer hparams.

Our observations on LibriTTS are in #449.

Since this is your first time training a model from scratch, I suggest decreasing the model dimensions, and use a larger reduction factor. This will help the model train faster, at the expense of quality. When you are confident things are working, revert to the defaults.

tts_embed_dims = 256,
tts_postnet_dims = 256,
tts_lstm_dims = 512,

tts_schedule = [(5,  1e-3,  20_000,  26),   # Progressive training schedule
                (5,  5e-4,  40_000,  26),   # (r, lr, step, batch_size)
                (5,  2e-4,  80_000,  26),   #
                (5,  1e-4, 160_000,  26),   # r = reduction factor (# of mel frames
                (5,  3e-5, 320_000,  26),   #     synthesized for each decoder iteration)
                (5,  1e-5, 640_000,  26)],  # lr = learning rate
ghost commented 3 years ago

You can also decrease max_mel_frames to a lower number (like 500) to discard longer utterances. This will also increase training speed.

Tomcattwo commented 3 years ago

@blue-fish , thanks for the reply. If I decide to go forward on this effort, I would plan to use train-clean-360. Easier to download, smaller size. After reading #449 , I agree that limiting max_mel_frames to 500 is a good idea. Thanks also for the accelerated training hparams info. R/ TC2

ghost commented 3 years ago

I suggest using both train-clean-100 and 360 to more closely match the training of the pretrained models. If you decide to pursue this, good luck and please consider sharing your models.

Tomcattwo commented 3 years ago

@blue-fish said: #I suggest using both train-clean-100 and 360 to more closely match the training of the pretrained models.#

How can I use both? Do I run training for 100k steps on train-cleanl100 then train another 200k steps using train-clean-360 on top of that? Or can I simply combine them both together in my datasets_root and train the combination once to 300k steps?

If you decide to pursue this, good luck and please consider sharing your models.

Absolutely, assuming that the models come out sounding good. Happy to share plots, mid-training .wavs etc. upon request. Regards, TC2

ghost commented 3 years ago

How can I use both? Combine them both together in my datasets_root and train the combination once to 300k steps?

Exactly.

ghost commented 3 years ago

@Tomcattwo Did you end up pursuing this? If yes, how is the training coming along?

Tomcattwo commented 3 years ago

@blue-fish , I have not started on this project yet. I have a few other projects (semi-related) working now. I read the TTS Corpus paper and it sounds interesting. Frankly I have gotten very good results from the single voice trained models I am using for my current prohect, but there's always room for improvement. I would love to be able to "help" the synthesizer using punctuation to tell it where to place the emphasis on a syllable or syllables in multi-syllabic a word... I would like to give a TTS-built-from-scratch synthesizer base a try once I get some of these other projects behind me. I will let you know when I start and I will keep you apprised of progress. No doubt I will hit some snags and will solicit your always-helpful advice. Regards, Tomcattwo

ghost commented 3 years ago

Pleased to know that you are satisfied with the single voice models. Please reopen this issue if you start training a model from scratch.