Azure-Samples / Cognitive-Speech-TTS

Microsoft Text-to-Speech API sample code in several languages, part of Cognitive Services.
https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/
Other
907 stars 515 forks source link

The neural voice synthersized by azure TTS is trembling,unclear,unemotional,unnatural. I listened the voice on the Wechat of my iPhone 6 . #116

Closed DerrickYang007 closed 5 years ago

DerrickYang007 commented 5 years ago

The neural voice synthersized by azure TTS is trembling,unclear,unemotional,unnatural. I listened the voice on my iPhone 6 . I listened the voice synthersized by IBM watson TTS on my iPhone6 ,too.It is emotional,and natural. So,what is the core reason? Anybody can help me to check the code? check the configuration. I have deleted url link. The code can running succesffuly ,but the result voice's quality is bad.Let me sad.I need your help , MicroSoft engineers friends. -----------------------------------------code-------------------------------------------------------------- var ShortName = 'en-US-JessaNeural'

//Southeast_Asia var MsKey = '' var MsUri = '' var BaseUrl = ''

// Gets an access token. function getAccessToken(subscriptionKey) { let options = { method: 'POST', uri: MsUri, headers: { 'Ocp-Apim-Subscription-Key': subscriptionKey } }

return rp(options); }

// Converts text to speech using the input from readline. function textToSpeech(accessToken, text) { // Create the SSML request. let xml_body = xmlbuilder.create('speak') .att('version', '1.0') .att('xmlns', '') .att('xmlns:mstts', '') .att('xml:lang', 'en-us') .ele('voice') .att('xml:lang', 'en-us') .att('name', ShortName) .ele('mstts:express-as') .att('type', 'cheerful') .ele('prosody') .att('pitch', 'default') .att('rate', 'slow') .att('volume', 'loud') .txt(text) .end(); // Convert the XML into a string to send in the TTS request. let body = xml_body.toString(); // console.log('xml_body=' + xml_body) let options = { method: 'POST', baseUrl: BaseUrl, url: 'cognitiveservices/v1', headers: { 'Authorization': 'Bearer ' + accessToken, 'cache-control': 'no-cache', 'User-Agent': 'YOUR_RESOURCE_NAME', 'X-Microsoft-OutputFormat': 'audio-24khz-160kbitrate-mono-mp3', 'Content-Type': 'application/ssml+xml' }, encoding: null, body: body }

request_ = rp(options).on('response', (response) => { if (response.statusCode == 200) { console.log(new Date().getTime() + '---文件获取成功Your file is ready.') RequestOk = true; } });

return request_; };

DerrickYang007 commented 5 years ago

I changed many output format ,The problem still exists. The output voice is played unclearly ,tremling on the Wechat of my iPhone. But it is played clearly on the window media player on Windows10. And request endpoint for example, West us , southeast asia

boltomli commented 5 years ago

Thanks for the feedback on voice quality. Let me check.

DerrickYang007 commented 5 years ago

Thanks for the feedback on voice quality. Let me check.

How about it?

When I change the the value of 'mstts:express-as' from 'cheerful' to 'empathy' then,the voice doesn't trembling , and the quality is better.

But I hope the quality should be better. The volume of voice is low though I set the value of attribult 'volume' to 100.0

boltomli commented 5 years ago

The constructed SSML will look like below, right? As I tried a few lines, slow rate will affect the overall quality the most. Can you try remove the tag?

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'><mstts:express-as type="cheerful" /><prosody rate="slow"><prosody volume="100%">The output voice is played unclearly, tremling on the Wechat of my iPhone. But it is played clearly on the window media player on Windows10.</prosody></prosody></voice></speak>
DerrickYang007 commented 5 years ago

The constructed SSML will look like below, right? As I tried a few lines, slow rate will affect the overall quality the most. Can you try remove the tag?

<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='en-US'><voice xml:lang='en-US' xml:gender='Female' name='Microsoft Server Speech Text to Speech Voice (en-US, JessaNeural)'><mstts:express-as type="cheerful" /><prosody rate="slow"><prosody volume="100%">The output voice is played unclearly, tremling on the Wechat of my iPhone. But it is played clearly on the window media player on Windows10.</prosody></prosody></voice></speak>

I have removed the tag 'slow',and change 'empathy' to 'cheerful', then tested the result voice both on the wechat of iOS and on the windows 10 . if the value of attritue mstts:express-as is cheerful,then the voice is trembling on the wechat of iPhone. I don't know why. And it sound like a chat bot's voice not as natural as human. The most important , the voice sound not beautifully as true woman's nature voice. But the mode is nearal vocie, not common voice. So the nearal voice have no high quality when it is not free,it is not good enough. Maybe you can listen the voice , Does it sound beatifully ,emotionally,naturally,whether on the iOS or on the Windows 10?

DerrickYang007 commented 5 years ago

The voice in the empathy mode sounds sadly a little,though the voice sound like a human,and natural and emotionally. The voice in the cheerful mode sounds badly.

boltomli commented 5 years ago

The actual sound should be the same no matter which platform you are on. Did you try save the wave stream to local Windows then play back on WeChat?

The general neural Jessa sounds clear most of the time. Emotional Jessa is experimental I believe.

DerrickYang007 commented 5 years ago

try save the wave stream to local Windows then play back on WeChat?

Yean,I have tried saving the wave stream to local Windows then play back on WeChat several times. I used the Emotional Jessa to test. Maybe I have expericed the IMB watson neural voice Alice's sound. Her voice sunds really beatifully. Emotional Jessa sounds human and emotional,but not beatifully.Sounding not beatifully means not 抑扬顿挫 in Maderin.
You can compare Emotional Jessa'sound of Microsoft neural voice and IMB watson neural voice Alice's sound.

DerrickYang007 commented 5 years ago

The actual sound should be the same no matter which platform you are on. Did you try save the wave stream to local Windows then play back on WeChat?

The general neural Jessa sounds clear most of the time. Emotional Jessa is experimental I believe.

I hope Microsoft will have a better Emotional 抑扬顿挫的 nearal voice.Laughing....