NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

How to fine tune a new voice using pretrained model #48

Open mathigatti opened 4 years ago

mathigatti commented 4 years ago

Hi, thanks for this amazing project! I wanted to ask a few short questions.

I want to train the model over a new voice, the dataset is similar to the LJ Speech Dataset with short audios of a single person (a man in this case) speaking english with a duration of between 1 and 10 seconds on each sample (Those are about 6 hours in total). I plan to use the pretrained LibriTTS model you provide as a start point.

  1. I'm not sure how should I prepare my dataset. The third column of the TXT that contains the transcriptions specifies the ID of the speaker, should I assign to my speaker a new ID or reuse the ID of the most similar speaker in the LibriTTS dataset?
  2. Do you have some other suggestions related to the hparams?
  3. I'm using 8 V100 GPUs so I'm able to use a batch size of about 24. Do you know how many iterations are usually needed to get descent results?
pythagoras000 commented 4 years ago

Hi @rafaelvalle can you please answer to the questions on this issue? I'm having similar problems and can't achieve the quality as on the papers.

rafaelvalle commented 4 years ago
  1. Either way should work.
  2. The defaults should be a good start. You might need to adjust F0_min to match the lowest F0 of your speaker.
  3. Train until the validation loss stops decreasing.

Let us know if you have specific issues.

pmakinde commented 2 years ago

@rafaelvalle , I trained a new speaker with 17mins of speaking data. After 9k iterations it generated a good alignment then used the best alignment checkpoint to test speaking style transfer. The style transfer was 100% perfect and can understand the words spoken by the trained speaker but the voice was a little croaky . The audio recordings of the trained speaker was not croaky voice.

Which of the training params in hparams.py can I tune to get rid of the croaky voice so the voice is much smoother like the trained speaker voice.

Shall I also adjust the F0_min to higher or lower? any other params to adjust?

I'm also thinking maybe I should increase the voice data from 17mins to 25mins and and re-train the speaker?