NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
854 stars 187 forks source link

Pitch contour not being applied #42

Closed tebin closed 4 years ago

tebin commented 4 years ago

I trained the model with my own dataset, and strangely mel_outputs_postnet is always the same regardless of the pitch contour value, given that all other inputs & seed are fixed (i.e. mellotron.inference_noattention((text_encoded, mel, speaker_id, pitch_contour_A, rhythm)) == mellotron.inference_noattention((text_encoded, mel, speaker_id, pitch_contour_B, rhythm)) where pitch_contour_A != pitch_contour_B ) Other than that everything works flawlessly so it makes me wonder what could have possibly gone wrong. I first thought the model wasn't able to extract pitch during training so I retrained it after tweaking harm_thresh but that doesn't seem to do anything.

UPDATE: So when prenet_f0_kernel_size=1, self.prenet_f0(f0s) outputs tensors with negative values which are then passed to relu resulting in all zero tensors. I'm currently experimenting with prenet_f0_kernel_size=3 and while it learns pitch the validation loss hovers around 0.5 and there is a significant degradation in the pronunciation quality.

UPDATE 2: Changing ConvNorm's weight initialization from xavier_uniform to xavier_normal also seems to solve the issue. Once I'm done with the experiment I'll report the results and close the issue.

UPDATE 3: Changing the seed also works.

kannadaraj commented 4 years ago

@tebin did changing the the help?

tebin commented 4 years ago

@tebin did changing the the help?

The key here is that when the initial value for prenetf0/conv/weight is negative you're hosed - it's stuck at 0 and will not increase any further no matter how long you train the model. Empirically the weight tends to converge to 0.2-0.4 so you might want to just initialize it manually using `torch.nn.init.constant(tensor, val)`.