Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.85k stars 1.84k forks source link

Audio Output on iOS Safari not working #1741

Closed TobiasG95 closed 1 year ago

TobiasG95 commented 1 year ago

Describe the bug The Text to Speech Output is not working on iOS Safari, but the same code works on Chrome, MacOS Safari and Edge.

To Reproduce I used the following simplified code to output the same text as in the demo-page (https://azure.microsoft.com/en-us/products/cognitive-services/text-to-speech/#features):

document.getElementById("test").addEventListener("click", function () {
        let sdkApiKey = "xxxxx"; //anonymized
        let sdkRegion = "westeurope";
        let speechConfig1 = SpeechSDK.SpeechConfig.fromSubscription(sdkApiKey, sdkRegion);
        speechConfig1.SpeechSynthesisOutputFormat = SpeechSDK.SpeechSynthesisOutputFormat.Audio24Khz96KBitRateMonoMp3;
        let azurePlayer1 = new SpeechSDK.SpeakerAudioDestination();

        let audioConfig1 = SpeechSDK.AudioConfig.fromSpeakerOutput(azurePlayer1);

        let azureSynth = new SpeechSDK.SpeechSynthesizer(speechConfig1, audioConfig1);
        azureSynth.speakSsmlAsync(
            '<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-JennyNeural"><prosody rate="0%" pitch="0%">You can replace this text with any text you wish. You can either write in this text box or paste your own text here.\n' +
            '\n' +
            'Try different languages and voices. Change the speed and the pitch of the voice. You can even tweak the SSML (Speech Synthesis Markup Language) to control how the different sections of the text sound. Click on SSML above to give it a try!\n' +
            '\n' +
            'Enjoy using Text to Speech!</prosody></voice></speak>',
            function (result) {
                window.console.log("success", result);
            },
            function (err) {
                window.console.log("Error", err);
            });
    });

The element with the id "test" ist just a simple button

Expected behavior I expect it to audio output the text on all browsers the same, or at least to be any output on iOS

Version of the Cognitive Services Speech SDK To my knowledge the most recent one and specifally the same as on the Azure demo page: https://azure.microsoft.com/scripts/Acom/Components/cognitiveServicesDemos/speechJsSdk/microsoft.cognitiveservices.speech.sdk.bundle.js?v=cba17aff7d5806570da8eaf3c40c38a87d8a00cb3fa198a24735fb5814390a7f

Platform, Operating System, and Programming Language

Additional context

Thanks for your help and let me know if you have any ideas what can be changed on the code to make it work. It is specifally confusing, because the Azure demo website works just fine.

glharper commented 1 year ago

@TobiasG95 Thanks for using Speech SDK, and writing this issue up. I was actually not able to get audio output on any iOS browser using your code sample. Using this code sample, however, I can get audio output on iOS browsers.

I will test again today and see if the difference is using speakTextAsync() vs speakSsmlAsync(), but I think the idiosyncratic iOS restrictions on autoplaying media (see this SO answer) could be the cause of this issue.

Tagging @yulin-li, in case he can add insight here.

glharper commented 1 year ago

@TobiasG95 I was able to get audio output by modifying the speakSsmlAsync call like this:

        azureSynth.speakSsmlAsync(
            '<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="en-US"><voice name="en-US-JennyNeural"><prosody rate="0%" pitch="0%">You can replace this text with any text you wish. You can either write in this text box or paste your own text here.\n' +
            '\n' +
            'Try different languages and voices. Change the speed and the pitch of the voice. You can even tweak the SSML (Speech Synthesis Markup Language) to control how the different sections of the text sound. Click on SSML above to give it a try!\n' +
            '\n' +
            'Enjoy using Text to Speech!</prosody></voice></speak>',
            function (result) {
                window.console.log("success", result);
                azureSynth.close();
                azureSynth = undefined;
            },
            function (err) {
                window.console.log("Error", err);
                azureSynth.close();
                azureSynth = undefined;
            });

Let me know if that works for you.

TobiasG95 commented 1 year ago

@glharper Thanks for your quick response.

I just tried your proposed fix and it worked! It seems that not calling azureSynth.close(); was my mistake, correct? Weird that it is working on Android and Desktop, but not on mobile iOS.

When not using a button and click() event, I still have some problems on iOS but that is almost certainly because on the restrictions on autoplaying media that you mentioned.

Thank you again for your help. Azure Text to Speech is a great addition to our app.