Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.83k stars 1.83k forks source link

Retrieving Blendshapes from Speech SDK Unity #1808

Closed slothful0928 closed 1 year ago

slothful0928 commented 1 year ago

Describe the bug When requesting the event.Animation json string for the blendshapes, nothing is returned.

To Reproduce Steps to reproduce the behavior:

var ssml = @$"<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts'>
    <voice name='{speechSynthesisVoiceName}'>
        <mstts:viseme type='FacialExpression'/>
        {response}
    </voice>
</speak>";
//Debug.Log(ssml);
synthesizer.VisemeReceived += (s, e) =>
{
    Debug.Log($"Viseme event received. Audio offset: " +
              $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}.Animation{e.Animation}");

    // `Animation` is an xml string for SVG or a json string for blend shapes
    string animation = e.Animation;
    Debug.Log(animation);

};

Expected behavior json string containing blendshapes should be returned

Platform, Operating System, and Programming Language

Additional context logFile.txt

ralph-msft commented 1 year ago

Have you double checked that you are passing the ssml with the <mstts:viseme type='FacialExpression'/> XML element is being passed to the synthesizer? Without that, you won't get viseme animations back.

From the logs, I can see that SSML being sent to the service is:

[384249]: 61917ms SPX_DBG_TRACE_VERBOSE:  usp_tts_engine_adapter.cpp:173 SSML sent to TTS cognitive service: <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xmlns:emo='http://www.w3.org/2009/10/emotionml' xml:lang='en-US'><voice name='en-US-CoraNeural'> Hi there! How can I help you?</voice></speak>

Note that the viseme element is not present.

I was able to get back viseme animations as follows:

var speechSynthesisVoiceName = "en-US-CoraNeural";
var speechLang = "en-US";
var response = "Hi there! How can I help you?";

var speechConfig = SpeechConfig.FromSubscription(subscriptionKey, region);
speechConfig.SpeechRecognitionLanguage = speechLang;
speechConfig.SpeechSynthesisVoiceName = speechSynthesisVoiceName;

using var synthesizer = new SpeechSynthesizer(speechConfig);

synthesizer.VisemeReceived += (s, e) =>
{
    Debug.Print($"Viseme event received. Audio offset: " +
              $"{e.AudioOffset / 10000}ms, viseme id: {e.VisemeId}, Animation: {e.Animation}");
};

var ssml = $@"<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts' xml:lang='{speechLang}'>
  <voice name='{speechSynthesisVoiceName}'>
    <mstts:viseme type='FacialExpression'/>
    {response}
  </voice>
</speak>";

await synthesizer.SpeakSsmlAsync(ssml);

Please note that not all viseme events raised will contain animations

rosedalerk commented 1 year ago

The SDK documentation indicates that each viseme event will have Animation frames. https://learn.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-speech-synthesis-viseme?tabs=3dblendshapes&pivots=programming-language-csharp

From the documentation: "Each viseme event includes a series of frames in the Animation SDK property."

In our application, we are facing the same issue, where most events return a viseme but empty blendshape data.

mattma1970 commented 7 months ago

I'm facing same issue using the python sdk for a streaming usecase. The visemes and the blendshapes are not in the same event objects. First the visemes for a chunk of audio are returned and then the visemerecieved callback is triggered multiple times with viseme=0 and audio_offset =0 for each event but with the animation json string for some number of frames returned each time. This appears to be by design but it also doubles the time to get the data back to start playing audio and rendering.

Is this the way its intended to work?