Training a new model based on LibriTTS

ghost commented 4 years ago

@blue-fish, Would it be useful if I was to offer a GPU (2080 ti) for contributing on training a new model based on LibriTTS ? I have yet to train any models and would gladly exchange GPU time for an opportunity to learn. I wonder how long it would take on a single 2080 ti.

Originally posted by @mbdash in https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/441#issuecomment-663076421

ghost commented 4 years ago

If I were to start again, I'd either keep punctuation or discard it entirely by switching back to LibriSpeech. Maybe increase the max mel frames (to 600 or 700) so the synth can train on slightly more complex sentences. So disregard the suggestions in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/449#issuecomment-671071445

Also, I am not noticing much improvement in voice quality when I increased the tacotron1 layer sizes in #447 to be more in line with what we have in the current tacotron2: https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/8110552273afe6eb6093faaf701be5215a8285c9#diff-a20e5738bee4a9f617e9faabe4e7e17e

For my next model I will revert those changes and initialize my weights using fatchord's pretrained tacotron1 model in the WaveRNN repo, which uses LJSpeech. My results with fatchord's hparams (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/447#issuecomment-670673528) show that it is sufficient for voice cloning.

ghost commented 4 years ago

@shoegazerstella Please make the change in https://github.com/blue-fish/Real-Time-Voice-Cloning/commit/5ad081ca25f9276fac31417c0bfce54c59c2a98f before testing your trained models with the toolbox. We should be using the postnet output for best results. The mel spectrograms and sample wavs from training already use the correct output so your training outputs are unaffected.

ghost commented 4 years ago

From https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443

You can lower the upper bound I put on utterance duration, which I suspect has for effect of removing long utterances that are more likely to have more pauses (I formally evaluated models trained this way to generate less frequent long pauses). It also trains faster and does not have drawbacks (with a good attention paradigm, the model can generate sentences longer than seen in training).

Based on experience here, fixing the attention mechanism needs to be the first step. If we reduce the max utterance duration without fixing attention, then the resulting model will have trouble synthesizing long sentences. In other words, reducing duration of training utterances is the reward for implementing a better attention paradigm.

Some alternatives are discussed and evaluated in 1910.10288. In the meantime we should go back to max_mel_frames = 900 (and accept the gaps that come along with it).

Pytorch implementation of Dynamic Convolution Attention here: https://github.com/mindslab-ai/cotatron/blob/master/modules/attention.py
Mozilla TTS also has some other attention mechanisms implemented on the dev branch.

ghost commented 4 years ago

(Removed pretrained model, it is no longer compatible with the current pytorch synthesizer.)

ghost commented 4 years ago

@shoegazerstella How is synthesizer training coming along?

shoegazerstella commented 4 years ago

@shoegazerstella How is synthesizer training coming along?

Hi @blue-fish, I am sharing plots and wavs. It seems it has far passed the 250k steps. What do you think of these results?

ghost commented 4 years ago

Hi @shoegazerstella ! The results look and sound great but we need to put the model to the test and see whether it generalizes well to new text.

During training of the synth, at every time step the Tacotron decoder is given the previous frame of the ground-truth mel spectrogram, and predicts the current frame using that info combined with the encoder output. When generating unseen speech, there is no ground truth spectrogram to rely upon, so the decoder has no choice but to use the previous predicted output. This may cause the synth to behave wildly for long or rarely seen input sequences. So testing is the only way to find out.

Would you please upload the current model checkpoint (.pt file) along with a copy of your synthesizer/hparams.py?

Edit: Until a vocoder is trained at 22,050 Hz you will have to use Griffin-Lim for testing. It will sound like garbage if you connect it to the original pretrained vocoder (trained at 16,000 Hz).

ghost commented 4 years ago

I just started training a synth on VCTK using these hparams and it is training quickly.

mbdash commented 4 years ago

Maybe you want to wait for the new encoder i am training?

I should be done in a couple days training a new encoder using

LibriSpeech + CommonVoice + VCTK for 315k steps (loss < 0.005)

then adding

VoxCeleb 1& 2 to continue the training. Loss is currently at <=0.1 at step 344k

shoegazerstella commented 4 years ago

Hi @blue-fish Here the last synth checkpoint + hparams.py I'm OOO till August 31st so I won't be able to test it or make adjustments for further trainings before that day. Thank you!

ghost commented 4 years ago

Thank you @shoegazerstella ! Do you want feedback on the model now, or wait until August 31st?

For anyone else who would like to try the above synthesizer model: here is a synthesizer/hparams.py that is compatible with the latest changes to my 447_pytorch_synthesizer branch.

ghost commented 4 years ago

Please see #501 everyone. Although LibriTTS wavs are trimmed so that there no leading or trailing silence, sometimes there are huge gaps in the middle of utterances and we can remove them by preprocessing the wavs. This should help improve the issue we see with gaps when synthesizing.

shoegazerstella commented 4 years ago

Hi @blue-fish, Did you have time for testing the model I sent? If not, I could do it, but I just wanted to understand if there was a testing script (that is using the usual test set and compute cumulative error metrics - if so, could you point me to it?), or if I should test it on some random examples. Thanks!

ghost commented 4 years ago

Welcome back @shoegazerstella . I tried your model by loading it in the toolbox with random examples. Not surprisingly, it still had much of the issues as the model I trained at 16,000 Hz. Would you please continue training of the model using the schedule below?

You should also increase the batch size to fully utilize the memory of your (dual?) v100 GPUs. Start training, and monitor the GPU memory utilization for a minute with watch -n 0.5 nvidia-smi. Keep adjusting until you are at 80-90% memory.

        ### Tacotron Training
        tts_schedule = [(7,  1e-3,    20_000,  96),   # Progressive training schedule
                        (5,  3e-4,    50_000,  64),   # (r, lr, step, batch_size)
                        (2,  1e-4,   100_000,  32),   #
                        (2,  1e-5, 2_000_000,  32)],  # r = reduction factor (# of mel frames
                                                      #     synthesized for each decoder iteration)
                                                      # lr = learning rate

ghost commented 4 years ago

We are still working actively on this, but collaborating elsewhere. If you are interested in contributing time towards the development of better models please leave a message in #474 .

zhuochunli commented 2 years ago

Synth Trained on LibriTTS 200k steps with old /original encoder.

https://drive.google.com/drive/folders/1ah6QNyB8jIcFuKusPOVdx0pPIZxeZeul?usp=sharing

Let me know if the link works. or not and if any files are missing.

Hi @mbdash, did you train the synthesizer using your trained 1M steps encoder afterwards? Cause I find your encoder is really good but this synthesizer is only based on original encoder and in Tensorflow form, I can't use it in Pytorch code now.

CorentinJ / Real-Time-Voice-Cloning

Training a new model based on LibriTTS #449