Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.81k stars 1.82k forks source link

How to automatically end recognition after the user is silent for N seconds? #2582

Open Quilljou opened 1 week ago

Quilljou commented 1 week ago

I am using Swift pod 'MicrosoftCognitiveServicesSpeech-iOS', '~> 1.25' for continuous speech recognition. I want to implement a feature where the recognition automatically stops if the user doesn't speak for N seconds after it starts. What is the best practice for this? How can I achieve it?

    return await withCheckedContinuation { continuation in
        print(region)
        var speechConfig: SPXSpeechConfiguration?
        do {
            speechConfig = try SPXSpeechConfiguration(authorizationToken: token, region: region)
        } catch {
            print("Error \(error) happened")
        }

        let systemLanguage = Locale.current.languageCode
        speechConfig?.speechRecognitionLanguage = systemLanguage == "zh" ? "zh-CN" : "en-US"

        let audioConfig = SPXAudioConfiguration()
        guard let speechConfig = speechConfig else {
            continuation.resume(returning: "")
            return
        }

        let reco = try! SPXSpeechRecognizer(speechConfiguration: speechConfig, audioConfiguration: audioConfig)

        var finalResult = ""
        var lastAudioInputTime = Date()
        var silenceTimer: Timer?

        reco.addRecognizingEventHandler { _, event in
            print("Recognizing")
            lastAudioInputTime = Date()
            if let onRecognizing = onRecognizing {
                DispatchQueue.main.async {
                    onRecognizing(event.result.text ?? "")
                }
            }
        }

        reco.addRecognizedEventHandler { _, event in
            if event.result.reason == .recognizedSpeech {
                finalResult += event.result.text ?? ""
                print("Recognized: \(event.result.text ?? "")")
                DispatchQueue.main.async {
                    onNewText(event.result.text ?? "")
                }
            }
        }

        reco.addSessionStoppedEventHandler { _, _ in
            DispatchQueue.main.async {
                silenceTimer?.invalidate()
                print("stopped")
                continuation.resume(returning: finalResult)
            }
        }

        reco.addSessionStartedEventHandler { _,_ in
            DispatchQueue.main.async {
                silenceTimer = Timer.scheduledTimer(withTimeInterval: 1.0, repeats: true) { _ in
                    let silenceDuration = Date().timeIntervalSince(lastAudioInputTime)
                    if silenceDuration >= VAD_TIME {
                        DispatchQueue.global().async {
                            try? reco.stopContinuousRecognition()
                        }
                        silenceTimer?.invalidate()
                    }
                }

                onRecognitionStarted()
            }
        }

        do {
            try reco.startContinuousRecognition()
        } catch {
            print("Error during continuous speech recognition: \(error)")
            continuation.resume(returning: "")
        }
    }

This is my current implementation, but sometimes it ends the recognition early.

pankopon commented 3 days ago

Hi, if your Speech SDK version is 1.25 then it's a very old version from January 2023 and you should upgrade to the current (1.40.0 as of this writing).

You can use silence timeouts to end recognition - please see attached a Python example (which is a lot faster to come up with...) that demonstrates the principle using microphone input: timeout.zip

There are two silence timeouts that can be controlled, initial and end (SPXPropertyId.speechServiceConnectionInitialSilenceTimeoutMs and SPXPropertyId.speechServiceConnectionEndSilenceTimeoutMs respectively, ref. SPXPropertyId).

For example, the sequence of events from the very beginning could be like

initial silence timeout
initial silence timeout
recognizing speech
recognized speech
end silence timeout
initial silence timeout
recognizing speech
...

Whenever either one of these silence timeouts occurs, there is a SpeechEndDetected event that you can subscribe to with addSpeechEndDetectedEventHandler. So if you want to automatically end recognition after silence of N seconds, whether before or after any speech has been recognized, set both timeouts to the (same) desired value (with setPropertyTo, example) and signal the end of recognition in the handler of SpeechEndDetected.

Note that you should not call stopContinuousRecognition inside an event handler which is called by the SDK, instead use some method to notify your application thread (similar to the Python example). Also, although the timeout values are in milliseconds, the actual moment when the timeout occurs can deviate from that by 100-300 ms depending on the service, network etc. so it's better to use just full seconds and not expect millisecond accuracy.

Quilljou commented 3 days ago

Thanks so much. And Is there a way to know when the user start speaking? some startspeaking event like that? @pankopon

pankopon commented 3 days ago

Yes there is addSpeechStartDetectedEventHandler for the SpeechStartDetected event. This can appear a bit earlier than the first Recognizing event.

Quilljou commented 3 days ago

seems this event is not user start speaking at version 1.2.0

On Tue, Sep 17, 2024 at 07:30 pankopon @.***> wrote:

Yes there is addSpeechStartDetectedEventHandler https://learn.microsoft.com/objectivec/cognitive-services/speech/spxrecognizer#addspeechstartdetectedeventhandler for the SpeechStartDetected event. This can appear a bit earlier than the first Recognizing event.

— Reply to this email directly, view it on GitHub https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2582#issuecomment-2354207289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEMPZXO56OS6HRGECAPFCPTZW5S3BAVCNFSM6AAAAABOAX27ROVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNJUGIYDOMRYHE . You are receiving this because you authored the thread.Message ID: @.*** .com>

pankopon commented 2 days ago

Do you mean "speech start detected" is not what you expect with "user start speaking"? That's really the only indication of that kind available. If you mean it's not working with your Speech SDK installation, make sure to use a current Speech SDK release. 1.40.0 is the latest as of now.

Quilljou commented 2 days ago

Yes. Currently I upgrade to the 1.4.0 version. I still face 2 problems. First I don't know when the user start speaking. The speechstartdetected event is behind the user actually speaking. So I have to do vad by myself Secondly I set speechServiceConnectionEndSilenceTimeoutMs to 3000. But the result is when user stop speaking. And after maybe 6 seconds I got the sessionstopped event

pankopon commented 2 days ago

If you want to know when speech is starting before audio is even sent to the service, then yes currently you'll have to detect it in the application. SpeechStartDetected comes from the service when audio has been processed to the extent that the presence of speech has been confirmed (for real, not just that there is something other than silence).

Silence timeouts are only triggered by a fixed duration of silence. With EndSilenceTimeout of X seconds, it occurs ~X seconds of silence after the latest Recognized phrase; with InitialSilenceTimeout of Y seconds it occurs after ~Y seconds of silence anywhere else. Yes, "SpeechEndDetected" as the event name can be a bit misleading in that sense, but it's really just based on what's been configured for EndSilenceTimeout after Recognized speech. (With InitialSilenceTimeout it's even more misleading since there was no SpeechStartDetected... we may adjust naming of exposed events in the future.) The "phrase end detected" is when a Recognized phrase is reported. The condition for "user has stopped speaking and is not likely to continue" is up to you to decide, but silence timeouts are one way to detect it.

Quilljou commented 1 day ago

Thanks for replying. That means I can't know when the user starts speaking from silence. And also I can't know when the user stops speaking from the capabilities of Speech SDK. I have to implement the VAD by myself on device