Closed slothful0928 closed 1 year ago
Have you double checked that you are passing the ssml with the <mstts:viseme type='FacialExpression'/>
XML element is being passed to the synthesizer? Without that, you won't get viseme animations back.
From the logs, I can see that SSML being sent to the service is:
[384249]: 61917ms SPX_DBG_TRACE_VERBOSE: usp_tts_engine_adapter.cpp:173 SSML sent to TTS cognitive service: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' xml:lang='en-US'><voice name='en-US-CoraNeural'> Hi there! How can I help you?</voice></speak>
Note that the viseme element is not present.
I was able to get back viseme animations as follows:
var speechSynthesisVoiceName = "en-US-CoraNeural";
var speechLang = "en-US";
var response = "Hi there! How can I help you?";
var speechConfig = SpeechConfig.FromSubscription(subscriptionKey, region);
speechConfig.SpeechRecognitionLanguage = speechLang;
speechConfig.SpeechSynthesisVoiceName = speechSynthesisVoiceName;
using var synthesizer = new SpeechSynthesizer(speechConfig);
synthesizer.VisemeReceived += (s, e) =>
{
Debug.Print($"Viseme event received. Audio offset: " +
$"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}, Animation: {e.Animation}");
};
var ssml = $@"<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='{speechLang}'>
<voice name='{speechSynthesisVoiceName}'>
<mstts:viseme type='FacialExpression'/>
{response}
</voice>
</speak>";
await synthesizer.SpeakSsmlAsync(ssml);
Please note that not all viseme events raised will contain animations
The SDK documentation indicates that each viseme event will have Animation frames. https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-csharp
From the documentation: "Each viseme event includes a series of frames in the Animation SDK property."
In our application, we are facing the same issue, where most events return a viseme but empty blendshape data.
I'm facing same issue using the python sdk for a streaming usecase. The visemes and the blendshapes are not in the same event objects. First the visemes for a chunk of audio are returned and then the visemerecieved callback is triggered multiple times with viseme=0 and audio_offset =0 for each event but with the animation json string for some number of frames returned each time. This appears to be by design but it also doubles the time to get the data back to start playing audio and rendering.
Is this the way its intended to work?
Describe the bug When requesting the event.Animation json string for the blendshapes, nothing is returned.
To Reproduce Steps to reproduce the behavior:
Expected behavior json string containing blendshapes should be returned
Platform, Operating System, and Programming Language
Additional context logFile.txt