Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.96k stars 1.86k forks source link

The text-to-speech feature takes too long to start. #2652

Open BigVeila opened 2 weeks ago

BigVeila commented 2 weeks ago

Hello everyone, I’m using the TTS feature, but it takes Azure around 1.5 to 2 seconds to start playing the audio. This results in a poor user experience, as my app relies on users actively listening to each sentence. I initialized SPXSpeechSynthesizer in a singleton beforehand, but the audio playback still experiences a delay. Here is my simple code.

func start(text: String, voiceName: String,rate:String, pitch: String, onStart: @escaping () -> Void, onCompleted: @escaping () -> Void, onError: @escaping (Error) -> Void) {

    self.onCompleted = onCompleted
    self.onError = onError
    self.onStart = onStart

    let ssml = """
    <speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='\(LANGUAGUE)'>
        <voice name='\(voiceName)'>
            <prosody rate="\(rate)" pitch="\(pitch)">
                        \(text)
            </prosody>
        </voice>
    </speak>
    """

    DispatchQueue.global(qos: .background).async { [weak self] in
        do {
            try self?.synthesizer?.speakSsml(ssml)
        } catch {
            DispatchQueue.main.async {
                onError(error)
            }
        }
    }
}
ChuuFu commented 2 weeks ago

@BigVeila Thanks for your feedback. As for the issue you mentioned, it’s actually due to an incorrect usage method. The correct approach should be to use the stream mode, synthesizing and playing the audio simultaneously, rather than waiting for the process to complete before playing.

Here’s the official user guide for your reference, please follow the correct method for proper operation: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-lower-speech-synthesis-latency?pivots=programming-language-csharp

BigVeila commented 1 week ago

@ChuuFu I have already tried the methods here, and I found that they didn’t make any difference. Do you have any sample for TTS?

ChuuFu commented 1 week ago

@BigVeila Please use SpeakSssmlAsync. Here is a sample: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/ae5ef1003a63a21f2f7c2351f54a796d1ee1dd0b/samples/python/console/speech_synthesis_sample.py#L253

BigVeila commented 1 week ago

@ChuuFu I tried following the sample you provided and also referred to the Objective-C code (I develop iOS app), but it still doesn’t produce any sound. Is there any documentation I can refer to about the differences between speech_synthesis_to_push_audio_output_stream and other functions like speech_synthesis_to_pull_audio_output_stream, etc.?