Closed innnky closed 1 year ago
We don't mind training it at 44100 Hz, so I'm not sure about the details. However, I think there may be some alignment issues with the Yingram.
I recommend printing the Yingram calculated in train.py like this:
wav =torch.nn.functional.pad(audio_norm.unsqueeze(0), (768, 768+ (-audio_norm.shape[1])%256 + 256*(audio_norm.shape[1]%256==0)), mode='reflect').squeeze(0)
ying = pitch.yingram(wav)
After that change pad value to make it aligned. We also suffer same issue like you in the early stage of the research and fix it by changing pad value. I think that our pad calculation is not scalable now.
thanks! i‘ll try it.
May I ask how can I train PITS at a sampling rate of 44100hz? Following Visinger2's approach, I have already modified the values of sampling_rate, filter_length, hop_length, win_length, segment_size, ying_hop, upsample_rates, upsample_kernel_sizes, and segment_size in the Avocado discriminator, and have correctly unified the lengths of yingram and spec. Finally, I was able to run the training code. However, the synthesized speech does not sound right. Is there anything else that needs to be modified?