CSTR-Edinburgh / merlin

This is now the official location of the Merlin project.
http://www.cstr.ed.ac.uk/projects/merlin/
Apache License 2.0
1.31k stars 441 forks source link

Emotions #137

Open rmmal opened 7 years ago

rmmal commented 7 years ago

hello everyone ,

Q1: how can I add some emotions in the synthesized voice ? Which parameters should I change to add some sort of happiness or anger or anything else ?

Q2: When my synthesized voice is not good enough (low Quality) , do you see that the features is bad or the question file is not effective enough ? Is there any new or good features used nowadays I can use to increase the quality of my voice ? Also why the default is 6 TANH layers with 1024 ? do you see this is the best structure or what ?

Thanks in advance.

dreamk73 commented 7 years ago

From your questions I gather that you don't know too much about text-to-speech synthesis, am I correct?

Q1: To add emotions to the synthesized voice you need emotional speech in your training material and you need to find a good way to describe them. It is not very straightforward as different emotions have different effects on the acoustic features. Example paper of analysis of affective (or emotional) speech: https://pdfs.semanticscholar.org/51f3/688143432bd05a1c503d1366687d70ecd8ba.pdf

Q2: Very hard to say without knowing what your data looks like and what you consider low quality. I have been playing around with different architectures but find that the resulting waveforms sound almost the same whether you use 4 TANH layers, 6 TANH layers or 4 TANH plus 2 SLSTM layers. I think it is much more important to have an accurate description of the phonemes in your utterances, accurate alignment of them to the audio, and good linguistic features for stress, accent, and position.