anonymous-pits / pits

PITS: Variational Pitch Inference for End-to-end Pitch-controllable TTS without External Pitch Predictor
https://anonymous-pits.github.io/pits/
MIT License
275 stars 34 forks source link

Trying ot use pitch predictor with different texts. #29

Open ljh0412 opened 10 months ago

ljh0412 commented 10 months ago

I know it is quite ackward to do this as the text encoder may produce latent variables which contains pitch, but I'm trying to use the reference linear spectrogram and posterior yingram encoder to generate reference pitch latents.

I want to ask you that if there is any method to use yingram encoder with text encoder.

When I tried to do that, of course it is, the latent vectors from yingram encoder and text encoder without the ying parts are mismatched.

Should I provide conditional text embedding inputs for yingram encoder?

Hope I can get a guidance to make some nice applications.

junjun3518 commented 10 months ago

Hi! I also tried similar approaches. You can use external yingram latent, but you need masking for ' ' (space or silence). While silence has its own distinct latent distribution, you should replace that corresponding latent generated from text input. If duration aligned input is "AAAA BBB CC", you should use replace mask as "00001111000111100".

ljh0412 commented 10 months ago

Hi! I also tried similar approaches. You can use external yingram latent, but you need masking for ' ' (space or silence). While silence has its own distinct latent distribution, you should replace that corresponding latent generated from text input. If duration aligned input is "AAAA BBB CC", you should use replace mask as "00001111000111100".

Thank you for your reply. Can I ask about this topic a bit more? I cannot see the external yingram latent and masking for spaces. I'd like to explain my problem more.

The decoder for pits may upsample concatenated latent representations of text and yingram. If I use gave spectrogram different with text to extract some kind of pitch prosody from the spectrogram, dimentionality mismatch may occur because the predicted durations is actually different with the input spectrogram. I tried to squeeze those but as you said the spaces did matter which made bunch of harmonics where silence should be placed. (it was funny tho)

Is the external yingram means the similar case as I told above? And the masking may mean that I need to get latent representation of silence for yingram embeddings and replace the parts of yingram which is aligned with spaces of text?

junjun3518 commented 10 months ago

Yeah due to the total length difference, it could not generate clean speech. I think you understand what I mean, space parts should be replaced by the yingram latent from text. I tried tiling identical values for flattened pitch or streching yingram latent by sinc interpolation, but these codes are not included in this repository.

ljh0412 commented 10 months ago

Now I think my thoughts are organized, and it's time to start the experiments. Thank you for your kind replies.