The synthesized speech sounds strange.

anonymous-pits / pits

PITS: Variational Pitch Inference for End-to-end Pitch-controllable TTS without External Pitch Predictor

https://anonymous-pits.github.io/pits/

MIT License

275 stars 34 forks source link

The synthesized speech sounds strange. #23

Open howitry opened 1 year ago

howitry commented 1 year ago

The PITS demo sounds very good. I just changed the sample rate to 16k, but the fundamental frequency change of synthesized speech and training corpus is quite different, even when using sentences in the training set. The rhythm of synthesized speech sounds unnatural. I suspect it's a problem with the yingram parameters, I don't know what parameters need to be adjusted, can you give a little advice?

anonymous-pits commented 1 year ago

Hi howitry! Please follow this issue for your setup. I check alignment for 22050 Hz, but it did not align in other sampling rate. You should change pad value for yingram to align it with spectrogram.

howitry commented 1 year ago

Hi howitry! Please follow this issue for your setup. I check alignment for 22050 Hz, but it did not align in other sampling rate. You should change pad value for yingram to align it with spectrogram.

In issue, why is the calculation o_pad used 1024 instead of 768. In addition, I think that the result of pading wav and then calculating ying, and then slicing ying, will always have an error with the result of padding wav_slice after slicing wav, and then calculating ying, how large should this error be controlled?

anonymous-pits commented 1 year ago

Padding wav is necessary to make identical size between spectrogram and yingram. Since spectrogram add padding inside the function, I added same pad to yingram.

howitry commented 1 year ago

Padding wav is necessary to make identical size between spectrogram and yingram. Since spectrogram add padding inside the function, I added same pad to yingram.

Ok, according to mel_processing.py, the length of the spectrogram is L//hop_size (L is the length of speech), but at https://github.com/anonymous-pits/pits/issues/7, when pad value=[768,1024+ (-y.shape[-1])%256 + 256*(y.shape[1]%256==0)] regardless of the sample rate, there seems to be no guarantee that the length of ying_hat is L//hop_size？