anonymous-pits / pits

PITS: Variational Pitch Inference for End-to-end Pitch-controllable TTS without External Pitch Predictor
https://anonymous-pits.github.io/pits/
MIT License
274 stars 34 forks source link

about training PITS at 44100hz sampling rate #7

Closed innnky closed 1 year ago

innnky commented 1 year ago

May I ask how can I train PITS at a sampling rate of 44100hz? Following Visinger2's approach, I have already modified the values of sampling_rate, filter_length, hop_length, win_length, segment_size, ying_hop, upsample_rates, upsample_kernel_sizes, and segment_size in the Avocado discriminator, and have correctly unified the lengths of yingram and spec. Finally, I was able to run the training code. However, the synthesized speech does not sound right. Is there anything else that needs to be modified?

anonymous-pits commented 1 year ago

We don't mind training it at 44100 Hz, so I'm not sure about the details. However, I think there may be some alignment issues with the Yingram.

I recommend printing the Yingram calculated in train.py like this:

wav =torch.nn.functional.pad(audio_norm.unsqueeze(0), (768, 768+ (-audio_norm.shape[1])%256 + 256*(audio_norm.shape[1]%256==0)), mode='reflect').squeeze(0)
ying = pitch.yingram(wav)
image

After that change pad value to make it aligned. We also suffer same issue like you in the early stage of the research and fix it by changing pad value. I think that our pad calculation is not scalable now.

innnky commented 1 year ago

thanks! i‘ll try it.