DigitalPhonetics / IMS-Toucan

Multilingual and Controllable Text-to-Speech Toolkit of the Speech and Language Technologies Group at the University of Stuttgart.
Apache License 2.0
1.36k stars 152 forks source link

Multilingual ZS-Multispeaker speaker embedding injection #48

Closed alexdemartos closed 1 year ago

alexdemartos commented 1 year ago

Hi Florian. I just read your paper on low resource multilingual zero-shot multi-speaker TTS (https://arxiv.org/pdf/2210.12223.pdf). Great job. Very interesting contributions, thanks for sharing this.

I was curious on the way you integrate speaker embeddings into the encoder hidden state. In the paper, it is mentioned: "An important trick we found is to add layer normalization right after the embedding is injected into the hidden state.". Does this mean you've experienced improvements on zero-shot adaptation or resulting audio quality by applying this layer norm after injecting the speaker embeddings? Also, does this mean the layer norm is applied right after concatenating the speaker embeddings to the encoder hs, or right after the spk_emb+hs are projected down to hs size?

Thank you in advance. Best!

Flux9665 commented 1 year ago

Hi Álex! I tried out a lot of different things when it comes to the embedding integration. In the beginning, we tried projecting the embedding to the size of the hidden state and then just adding the two together. Then we tried out the conditional layer norm that adaspeech proposes and still uses in their newest version (https://arxiv.org/abs/2204.00436). Finally we tried the concatenation and projection back down to the hidden size that you mention and that we are using right now. We find that the layer norm helps with generalization for the adding method and the concatenation+projection method. In the conditional layer norm it's already sort of integrated, so no need to do it there. And yes, as you say, generalization in this case means the speaker similarity when doing zero-shot adaptation. Not much impact on the audio quality.

We apply the layer norm after the projection to hs size, so it's the final thing that is happening in the encoder. I still want to give the conditional layer norm another try though. The demo samples of adaspeech 4 sound incredible, and it's what they are using. So far however, I got the best results with the concat+project method both in terms of zero-shot speaker similarity and audio quality. Adding was a bit quicker to converge, but the final quality was less good.

alexdemartos commented 1 year ago

Thanks for your detailed response. I'll give it a try. I also tried conditional layer normalization as in AdaSpeech4 with very inconsistent results, but maybe there was something wrong with my implementation.