NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Information about p_teacher_forcing hyperparameter #55

Closed paarthneekhara closed 4 years ago

paarthneekhara commented 4 years ago

I want to know what p_teacher_forcing was set as while training mellotron. I am using the default value 1.0 and I am not able to get proper alignment/attention map even after 100k steps. I was wondering if something else was used in training the LibriTTS model.

rafaelvalle commented 4 years ago

We used the default value 1.0 as wel.

paarthneekhara commented 4 years ago

Thanks @rafaelvalle ! After around 50k training steps, the alignment map looks like this. I did make a change in the implementation and removed the conditioning on pitch contours (f0s).

Screen Shot 2020-04-08 at 7 27 19 PM

Is this normal? Do you recall by any chance how long it takes for alignment maps to appear like a diagonal line while training mellotron on LibriTTS?

paarthneekhara commented 4 years ago

Ah.. Just noticed this in the paper. Makes sense, closing this issue. "In our setup, we find it easier to first learn attention alignments on speakers with large amounts of data and then finetune to speakers with less data. Thus, we first train Mellotron on LJS and Sally and finetune it with a new speaker embedding on LibriTTS, starting with a learning rate of 5e-4 and annealing the learning rate as the loss starts to plateau."