DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.45k stars 162 forks source link

normalization pitch and energy #51

Closed WhiteFu closed 1 year ago

WhiteFu commented 2 years ago

Hi Florian. I just read your paper on EXACT PROSODY CLONING IN ZERO-SHOT MULTISPEAKER TEXT-TO-SPEECH(https://arxiv.org/pdf/2206.12229.pdf). Great job. Very interesting contributions, thanks for sharing this.

I was curious on the way you normalization pitch and energy? In paper, it's mention: "a way of normalizing pitch and energy that allows for the overwriting procedure to be compatible with a zero-shot multispeaker setting by regaining the value ranges for each speaker through an utterance level speaker embedding" and "we normalize those values by dividing them by the average of the sequence that they occur in. "

Does it mean 1) that utterance speaker embedding as the input of the variance adaptor ? 2) use the phone level pitch/energy feature? Thank you in advance. Best!

Flux9665 commented 2 years ago

Hi! It's a very simple solution: We calculate the average of all frames in an utterance that are not 0

https://github.com/DigitalPhonetics/IMS-Toucan/blob/ee3b7988b15f871414829d30f24b5cc2a9d097b3/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/PitchCalculator.py#L58

and then we divide the entire sequence by this average

https://github.com/DigitalPhonetics/IMS-Toucan/blob/ee3b7988b15f871414829d30f24b5cc2a9d097b3/TrainingInterfaces/Text_to_Spectrogram/FastSpeech2/PitchCalculator.py#L59

This is done for both pitch and energy. This way we get float values that are centered around 1 that maintain information about the pitch curve and its variance. Those values will always be around 1, regardless of the fundamental frequency of the speaker's voice. So we transform the absolute values into relative values, which can be transferred to other speakers. But to produce the speech, the model has to somehow reconstruct the absolute values, however for the target speaker, not the source speaker. And we don't know those values for the target speaker. But luckily, the model can learn to do this internally, as long as it has the speaker embedding available as conditioning signal. And this is the case anyway, because we train the TTS as multispeaker model, so the speaker embedding is combined with the encoder output.

Answers directly to your questions:

  1. The speaker embedding is combined with the encoder output, which then becomes the input to the variance adaptor
  2. Yes, we calculate one average value per phone (as in FastPitch) and then afterwards we normalize all of those values to be relative instead of absolute (so we can transfer curves between speakers).
WhiteFu commented 2 years ago

Thank you for your patient reply, very interesting work! Can we synthesize audio with the same prosody and different text ?

Flux9665 commented 2 years ago

This technique is meant for cloning exactly unit by unit. To transfer the general prosody to a different text, we use a different technique. There we use speaker embeddings, however not speaker embeddings that are trained as speaker verificcation model (which is usually then case). We use speaker embeddings that are trained with the GST approach and the modifications that adaspeech 4 proposes to the GST setup. By using an embedding derived from a reference audio with desired prosody there, we can control the general prosody, however no longer independent of the speaker.