Closed candlewill closed 4 years ago
@candlewill Hi, do you have anything in progress? I also think that it could be a problem when we train the FastPitch model with multi-speaker model case because the pitch value of unvoiced parts will be different for every speaker.
That is a problem still in.
I am learning the FastPitch1.1, When on training process this will be not a problem because we can get pitch unvoiced mask from pitch_tgt, but in the inference process, we have no GT unvoiced mask, so we can not have accurate positions where the frames are unvoiced. Especially, I am changing the code to predict framel-level pitch so that the output can fit special vocoders like RefineGAN.
I had taken an Inflexible method which may not effective all the time. With the same pitch extraction method, pYIN, I had statistic all training frame pitches, A threshold was made to bestly distingush the pitch value near pitch_mean and unvoiced value: 0.0; In my dataset, for most speakers, 0.1 Hz is enough. so in the inference process, there will be a fake unvoice mask like that:
# assume `pitch_dense_pred` is the frame-level pitch that my FastPitch had predicted.
boundary = 0.1 / self.pitch_std[batch_speakers]
voiced_pitch_mask = torch.abs(pitch_dense_pred) > boundary
pitch_dense_pred = torch.where(voiced_pitch_mask, pitch_dense_pred, torch.tensor(0.0, dtype=torch.float32))
Related to fastPitch/Pytorch
Describe the bug As the pitch is normalization by mean-variance normalization, the pitch with non-zero value is converted to zero-center. The pitch of unvoice part is zero. After pitch normalization, we cannot distinguish the pitch at unvoiced part and these near the mean value.