DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.48k stars 167 forks source link

Clarification on Multilispeaker Multilanguage Branch version #82

Closed ssolito closed 1 year ago

ssolito commented 1 year ago

Hello,

I would like to confirm that I understand the following points correctly:

-for Pitch estimations use Parselmout -for duration estimations use DeepForcedAligner -for energy estimations instead what would you use? In the FastSpeech2 paper it is stated thus "L2-norm of the amplitude of each short-time Fourier transform (STFT) frame is regarded as the energy and then the quantized energy is mapped to the embedding"

Also I would like to know if for li Pitch predicate you also use CWT and iCWT?

I apologize for these questions but I am not an expert and I have difficulty understanding how the architecture works (I tried to deduce it by myself for quite a few days before asking you, I swear :)

Thank you Sarah

Flux9665 commented 1 year ago

Hi Sarah,

Sorry for taking so long to respond, I was sick for a long time. All correct: We use Parselmouth to extract the true pitch curve. For the durations, we made our own version of the DeepForcedAligner, but it's pretty similar to the original. To extract the ground-truth energy, we use the implementation of ESPnet, which is true to the FastSpeech 2 paper I believe.

We don't use CWT and iCWT, we directly predict the curve. That's a good point though, it would be interesting to try this out. I'm not really sure why they did it in the original FastSpeech 2 architecture, predicting the curve directly works fine. The only problem is mode collapse, all utterances sound pretty much the same, there is very little variance.

The basic version of the architecture in this toolkit is just FastSpeech 2, except that the encoder and decoder are built with the Conformer architecture and we represent our inputs as meaningful feature-vectors rather than a lookup of an identity. We also added the postnet from the Tacotron 2 paper to the end of the pipeline, just like in the ESPnet implementation, however we will replace this with the normalizing-flow based postnet from PortaSpeech in the next version of the toolkit.