Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.96k stars 1.86k forks source link

Windows: Calling `SpeechSynthesizer.StopSpeakingAsync()` does not stop synthesis #2350

Open bpasero opened 7 months ago

bpasero commented 7 months ago

Describe the bug

A call to SpeechSynthesizer.StopSpeakingAsync() does not stop synthesis for a very long time, up to 30 seconds. The log file is here: speech.log

This issue was previously reported without action at https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/1836 and https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2264

To Reproduce

We are building a node.js binding for Speech SDK and the C++ sources mimic the samples. The synthesis is implemented here: https://github.com/microsoft/node-speech/blob/967976ce0f4887a2b5b27f486e5209a51588516f/src/main.cc#L477

The call to StopSpeakingAsync here: https://github.com/microsoft/node-speech/blob/967976ce0f4887a2b5b27f486e5209a51588516f/src/main.cc#L539

To reproduce from that module:

[1]

const t = createSynthesizer({
  modelPath: '<path to TTS model>',
  modelName: 'Microsoft Server Speech Text to Speech Voice (en-US, AriaNeural)',
  modelKey: '<model key>',
}, (error, result) => {
  if (error) {
    console.error(error);
  } else {
    console.log(result);
  }
});
t.synthesize(`
Now more than ever, developers are expected to build voice-enabled applications that can reach a global audience. With the same voice persona across languages, organizations can keep their brand image more consistent. To support the growing need for a single voice to speak multiple languages, particularly in scenarios such as localization and translation, a multi-lingual neural TTS voice is brought out in public preview.

This new Jenny Multilingual voice (preview), with US English as the primary/default language, can speak 13 secondary languages, each at the fluent level: German (Germany), English (Australia), English (Canada), English (Canada), Spanish (Spain), Spanish (Mexico), French (Canada), French (France), Italian (Italy), Japanese (Japan), Korean (Korea), Portuguese (Brazil), Chinese (Mandarin, Simplified).
`);
setTimeout(() => t.stop(), 5000);

Expected behavior

Calling SpeechSynthesizer.StopSpeakingAsync immediately stops synthesis.

Version of the Cognitive Services Speech SDK

1.37.0

Platform, Operating System, and Programming Language

Additional context

This issue does not reproduce on macOS or Linux!

ralph-msft commented 7 months ago

Thanks for using the Speech SDK and filing this issue. We have been able to reproduce the issue you are seeing, and have added fixing this issue to our backlog. We will update here once we have an update.

As a temporary workaround, you may want to consider passing a null value as the AudioConfig to the SpeechSynthesizer constructor. You can then subscribe to the Synthesizing event which will be raised whenever the SDK receives new audio from the service. You can then pass this audio to your player of choice which should give you more control over when the audio playback stops. Please note however that calling StopSpeakingAsync may still stall for ~10-15 seconds due the underlying issue.

(B-7172399)

bpasero commented 7 months ago

Thanks, good to see it can be reproduced and I am looking forward to the fix 👍

github-actions[bot] commented 6 months ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

wtto00 commented 6 months ago

Hello, I am using version 1.37.0, and I have encountered a similar issue.

stopSpeaking does not immediately terminate the playback process; it only stops the speaker from playing.

For example, if I generate a 14-second audio and execute stopSpeaking at 10 seconds, then let speakResult = synthesizer?.speakSsml(ssml) will immediately return with speakResult?.reason=9(SPXResultReason_SynthesizingAudioCompleted) instead of 1(SPXResultReason_Canceled). Moreover, the callback registered with synthesizer?.addSynthesisCompletedEventHandler is triggered after waiting for 4 seconds, rather than the callback registered with synthesizer?.addSynthesisCanceledEventHandler.

let ssml =
          "<speak version='1.0' xml:lang='en-US' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='http://www.w3.org/2001/mstts'><voice name='\(identifier)'>\(mstts)</voice></speak>"
let speakResult = try self.synthesizer?.speakSsml(ssml)
print(speakResult?.reason ?? "")
try synthesizer?.stopSpeaking()

Here is a demo repositorie: https://github.com/wtto00/flutter_azure_speech/tree/main/example

The swift code is in https://github.com/wtto00/flutter_azure_speech/blob/eb419b89fcc16903cabaa8f9820559d93ed80861/ios/Classes/AzureSpeechPlugin.swift#L294

github-actions[bot] commented 5 months ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

bpasero commented 5 months ago

Please keep.

github-actions[bot] commented 5 months ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

bpasero commented 5 months ago

Please keep.

baby-bibin commented 4 months ago

Any update on this issue, I am also stuck with this.

streamlify commented 2 months ago

wow can't believe this issue is still open after so long. Any update?

harrybin commented 1 month ago

Hi, I suffer from the same issue. Still reproducable. Is there any workaround?

wtto00 commented 1 month ago

A temporary solution: Use connection.close() instead of synthesizer.stopSpeaking().

harrybin commented 1 month ago

hm... where do you get the connection object from? In my case the connection is somewhere under the hood of SpeechSynthesizer which I call/create using the config from SpeechConfig.fromSubscription

wtto00 commented 1 month ago

connection from Connection.fromSynthesizer

Here is a example: stopSynthesize

harrybin commented 1 month ago

connection from Connection.fromSynthesizer

Here is a example: stopSynthesize

Ah, thanks. It's a bit faster with cancellation than synthesizer.close(), but still the audio already buffered plays several seconds. I now found a workaround by accessing the private audio object:

//DANGER!FRAGILE uses private objects to work around issue: https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2350
   function KillAudio(synthesizer: sdk.SpeechSynthesizer) {      
        // kill the audio
        const audio: HTMLAudioElement | undefined = synthesizer.privAdapter?.privSessionAudioDestination?.privDestination?.privAudio;
        if(audio)
        {
            audio.pause();
            audio.currentTime = 0;
        }
    }

(this immediatelly stops the audio playback) and then I call synthesizer.close(). But this is fragile code accessing private objects, I need to find a way to access that audio object in an official way.

aman-vohra-007 commented 2 weeks ago

I have used the microsoft-cognitiveservices-speech-sdk for viseme so I have used ref in ReactJS for the synthesizer.

import * as sdk from "microsoft-cognitiveservices-speech-sdk"

const synthesizeSpeech = text => { return new Promise((resolve, reject) => { if (!speechSynthesizerRef.current) { const speechConfig = sdk.SpeechConfig.fromSubscription( import.meta.env.VITE_SPEECH_KEY, import.meta.env.VITE_SPEECH_REGION ) speechSynthesizerRef.current = new sdk.SpeechSynthesizer(speechConfig) let speechStarted = false ..... }

And to stop the speech, I did this const stopSpeech = () => { try { setImageIndex(0) setIsAudioPlaying(false) if (speechSynthesizerRef.current) { const audio = speechSynthesizerRef.current.privAdapter?.privSessionAudioDestination?.privDestination ?.privAudio if (audio) { audio.pause() audio.currentTime = 0 speechSynthesizerRef.current.close() speechSynthesizerRef.current = null } } } catch (e) { console.error("Error in stopSpeech:", e) } }

This helped in stopping the speech as well as resetting the synthesis, so if you play it again, the audio starts too.