Emotions - Githubissues

From your questions I gather that you don't know too much about text-to-speech synthesis, am I correct?

Q1: To add emotions to the synthesized voice you need emotional speech in your training material and you need to find a good way to describe them. It is not very straightforward as different emotions have different effects on the acoustic features. Example paper of analysis of affective (or emotional) speech: https://pdfs.semanticscholar.org/51f3/688143432bd05a1c503d1366687d70ecd8ba.pdf

Q2: Very hard to say without knowing what your data looks like and what you consider low quality. I have been playing around with different architectures but find that the resulting waveforms sound almost the same whether you use 4 TANH layers, 6 TANH layers or 4 TANH plus 2 SLSTM layers. I think it is much more important to have an accurate description of the phonemes in your utterances, accurate alignment of them to the audio, and good linguistic features for stress, accent, and position.

CSTR-Edinburgh / merlin

Emotions #137