Closed tebin closed 4 years ago
@tebin did changing the the
@tebin did changing the the help?
The key here is that when the initial value for prenetf0/conv/weight is negative you're hosed - it's stuck at 0 and will not increase any further no matter how long you train the model. Empirically the weight tends to converge to 0.2-0.4 so you might want to just initialize it manually using `torch.nn.init.constant(tensor, val)`.
I trained the model with my own dataset, and strangely
mel_outputs_postnet
is always the same regardless of the pitch contour value, given that all other inputs & seed are fixed (i.e.mellotron.inference_noattention((text_encoded, mel, speaker_id, pitch_contour_A, rhythm)) == mellotron.inference_noattention((text_encoded, mel, speaker_id, pitch_contour_B, rhythm))
wherepitch_contour_A != pitch_contour_B
) Other than that everything works flawlessly so it makes me wonder what could have possibly gone wrong. I first thought the model wasn't able to extract pitch during training so I retrained it after tweakingharm_thresh
but that doesn't seem to do anything.UPDATE: So when
prenet_f0_kernel_size=1
,self.prenet_f0(f0s)
outputs tensors with negative values which are then passed to relu resulting in all zero tensors. I'm currently experimenting withprenet_f0_kernel_size=3
and while it learns pitch the validation loss hovers around 0.5 and there is a significant degradation in the pronunciation quality.UPDATE 2: Changing
ConvNorm
's weight initialization fromxavier_uniform
toxavier_normal
also seems to solve the issue. Once I'm done with the experiment I'll report the results and close the issue.UPDATE 3: Changing the seed also works.