training - Githubissues

Wow, I just saw this and it may be a bit late, sorry 😅

Training process is explained a bit in the report. Summary:

Model was trained end to end
Take two speech from one speaker: speech 1, speech 2
Input speech 1 audio to encoder, this outputs a latent vector z
Input z vector and speech 2 text to decoder, this outputs synthesized voice
This synthesized voice is optimized to be closer to speech 2 audio.

Encoder part is trained more than decoder part-I had a hyperparameter to tune the ratio-because training encoder is harder than training the decoder. Training took 3-4 days with a GTX 1070.

In hindsight, I would improve my result with these two steps without changing the model architecture:

I used VCTK dataset, which includes 44 hours of speaking with mostly British accent. There are other datasets such as Librispeech and Voxceleb that are much bigger(>1000 hours) and much more diverse.
I wouldn't start the training from scratch. I would first train the model with same speech as input and output, and empty text input to decoder. Kind of an autoencoder that's learning speech representation. After this is trained sufficiently, I would save the encoder, reset the decoder, and start the original training process. This way, encoder at least knows about speech patterns and can adapt itself to style understanding much faster, compared to starting from scratch.

There are recent very successful projects, check them out too: https://github.com/CorentinJ/Real-Time-Voice-Cloning

Epokhe / voice-mimicker

training #1