anonymous-pits / pits

PITS: Variational Pitch Inference for End-to-end Pitch-controllable TTS without External Pitch Predictor
MIT License
274 stars 34 forks source link

about training PITS at 44100hz sampling rate #7

Closed innnky closed 1 year ago

innnky commented 1 year ago

May I ask how can I train PITS at a sampling rate of 44100hz? Following Visinger2's approach, I have already modified the values of sampling_rate, filter_length, hop_length, win_length, segment_size, ying_hop, upsample_rates, upsample_kernel_sizes, and segment_size in the Avocado discriminator, and have correctly unified the lengths of yingram and spec. Finally, I was able to run the training code. However, the synthesized speech does not sound right. Is there anything else that needs to be modified?

anonymous-pits commented 1 year ago

We don't mind training it at 44100 Hz, so I'm not sure about the details. However, I think there may be some alignment issues with the Yingram.

I recommend printing the Yingram calculated in like this:

wav =torch.nn.functional.pad(audio_norm.unsqueeze(0), (768, 768+ (-audio_norm.shape[1])%256 + 256*(audio_norm.shape[1]%256==0)), mode='reflect').squeeze(0)
ying = pitch.yingram(wav)

After that change pad value to make it aligned. We also suffer same issue like you in the early stage of the research and fix it by changing pad value. I think that our pad calculation is not scalable now.

innnky commented 1 year ago

thanks! i‘ll try it.