Closed WhiteFu closed 1 year ago
Hi! It's a very simple solution: We calculate the average of all frames in an utterance that are not 0
and then we divide the entire sequence by this average
This is done for both pitch and energy. This way we get float values that are centered around 1 that maintain information about the pitch curve and its variance. Those values will always be around 1, regardless of the fundamental frequency of the speaker's voice. So we transform the absolute values into relative values, which can be transferred to other speakers. But to produce the speech, the model has to somehow reconstruct the absolute values, however for the target speaker, not the source speaker. And we don't know those values for the target speaker. But luckily, the model can learn to do this internally, as long as it has the speaker embedding available as conditioning signal. And this is the case anyway, because we train the TTS as multispeaker model, so the speaker embedding is combined with the encoder output.
Answers directly to your questions:
Thank you for your patient reply, very interesting work! Can we synthesize audio with the same prosody and different text ?
This technique is meant for cloning exactly unit by unit. To transfer the general prosody to a different text, we use a different technique. There we use speaker embeddings, however not speaker embeddings that are trained as speaker verificcation model (which is usually then case). We use speaker embeddings that are trained with the GST approach and the modifications that adaspeech 4 proposes to the GST setup. By using an embedding derived from a reference audio with desired prosody there, we can control the general prosody, however no longer independent of the speaker.
Hi Florian. I just read your paper on EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH(https://arxiv.org/pdf/2206.12229.pdf). Great job. Very interesting contributions, thanks for sharing this.
I was curious on the way you normalization pitch and energy? In paper, it's mention: "a way of normalizing pitch and energy that allows for the overwriting procedure to be compatible with a zero-shot multispeaker setting by regaining the value ranges for each speaker through an utterance level speaker embedding" and "we normalize those values by dividing them by the average of the sequence that they occur in. "
Does it mean 1) that utterance speaker embedding as the input of the variance adaptor ? 2) use the phone level pitch/energy feature? Thank you in advance. Best!