Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.95k stars 1.86k forks source link

buffer filled, pausing addition of binaries until space is made #2046

Closed albseb511 closed 1 year ago

albseb511 commented 1 year ago

Describe the bug Speech recognition, speech synthesis fails every now and then. The error buffer filled, pausing addition of binaries until space is made comes up as a response. Not been able to debug this.

To Reproduce The bug is quite random. But 1 out of 5 times this comes up. Especially longer conversations.

Expected behavior The application seems to be crashing when it's synthesizing or when it's recognizing audio

Version of the Cognitive Services Speech SDK

Platform

Additional context I am not sure if i have a similar problem

buffer filled, pausing addition of binaries until space is made

This is the error that is coming up. This issue comes up on two cases

  1. When an audio is generated using text to speech
  2. When an audio is recognized using speech to text.

Help is appreciated

glharper commented 1 year ago

@albseb511 Thank you for using JS Speech SDK, and writing this issue up. Could you provide sample code for single user reproduction? This error should only occurs during speech synthesis, and it's an audio output error in the SDK caused by a sourceBuffer.appendBuffer throw, but unfortunately the SDK eats the original error message (which could help us understand why the appendBuffer failed) and replaces it with that "buffer filled" message.

albseb511 commented 1 year ago

I am using this in a react app. its a big large to share the entire thing

The following is a sample code which is used for audio synthesis

const startSpeechRecognition = async () => {
    const audioConfig = sdk.AudioConfig.fromDefaultMicrophoneInput();

    if (!config) {
        const { token, region } = (await getAuthTokenAzure()) ?? {};
        if (!token || !region) return console.error(`Error getting token or region`);
        setConfig({ t: token, r: region });
    }

    const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(config?.t, config?.r);
    speechConfig.speechRecognitionLanguage = 'en-US';
    recognizerRef.current = new sdk.SpeechRecognizer(speechConfig, audioConfig);

    setLoading.on();

    recognizerRef.current.recognizing = handleRecognizing;
    recognizerRef.current.recognized = handleRecognized;
    recognizerRef.current.canceled = handleCanceled;
    recognizerRef.current.sessionStopped = handleSessionStopped;

    recognizerRef.current.startContinuousRecognitionAsync(
        () => {},
        (err) => {
            recognizerRef.current?.close();
            setLoading.off();
        }
    );
};

const handleRecognizing = (s, e) => {
    // handle response
};

const handleRecognized = (s, e) => {
    if (e.result.reason == sdk.ResultReason.RecognizedSpeech) {
       // logic
    } else if (e.result.reason == sdk.ResultReason.NoMatch) {
       // logic
    }
};

const handleCanceled = (s, e) => {
    if (e.reason == sdk.CancellationReason.Error) {
        console.error("Error in speech recognition:", e.errorDetails);
    }
    recognizerRef.current?.stopContinuousRecognitionAsync();
};

const handleSessionStopped = (s, e) => {
    recognizerRef.current?.stopContinuousRecognitionAsync();
};
albseb511 commented 1 year ago

we have seen system breaking while we are speaking and while synthesizing as well. But let me double check that for you, since i am guessing you are talking about the audio data is getting pushed into the array buffer only happens during synthesis. I will try to see when i am speaking the issue comes up again.

In the meanwhile, This is for speech synthesis part

I did try v1.3.1 as well, and error did come up.

const startTextToSpeech = async (text: string, cancelEndCallback?: boolean): Promise<() => void> => {
        // Creates an audio instance.
        if (playerRef.current?.id) {
            playerRef.current?.pause()
            playerRef.current?.close()
            audioConfig.current?.close()
            playerRef.current = null
        }
        const player = new sdk.SpeakerAudioDestination()
        playerRef.current = player

        player.onAudioEnd = () => {
            // handle response
        }
        audioConfig.current = sdk.AudioConfig.fromSpeakerOutput(player)
        if (!config) {
            const { token, region } = (await getAuthTokenAzure()) ?? {}
            if (!token || !region) return console.log(`Error getting token or region`)
            setConfig({ t: token, r: region })
            const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(token, region)
            speechConfig.speechSynthesisVoiceName = voice
            speechSythesizerRef.current = new sdk.SpeechSynthesizer(speechConfig, audioConfig.current)
        } else {
            const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(config?.t, config?.r)
            speechConfig.speechSynthesisVoiceName = voice
            speechSythesizerRef.current = new sdk.SpeechSynthesizer(speechConfig, audioConfig.current)
        }

        // Receives a text from console input and synthesizes it to speaker.
        try {
            speechSythesizerRef.current.speakTextAsync(
                text,
                (result) => {
                    if (result) {
                        // debug statement with description
                        speechSythesizerRef.current?.close()
                        audioConfig.current?.close()
                        speechSythesizerRef.current = null
                        return result.audioData
                    }
                },
                (error) => {
                    console.log(error)
                    speechSythesizerRef.current?.close()
                    audioConfig.current?.close()
                    speechSythesizerRef.current = null
                }
            )
            speechSythesizerRef.current.synthesisStarted = () => {
                // debug statement
            }
        } catch (err) {
            toast({
                title: 'Error',
                description: 'Error with text to speech',
                status: 'error',
                duration: 5000,
                isClosable: true,
            })
        }
        return () => {
            console.log(`closing player`)
            playerRef.current?.pause()
            playerRef.current?.close()
            audioConfig.current?.close()
            playerRef.current = null
            speechSythesizerRef.current?.close()
            speechSythesizerRef.current = null
            event.current?.close()
        }
    }
glharper commented 1 year ago

@albseb511 Thanks for including sample code. Before you assign to any "Ref.current" variable, please make sure the existing Ref.current has been closed and nulled. This could easily be a memory leak (or just non-optimally disposing of resources) where an existing recognizer/synthesizer instance loses its assignment to a Ref.current without being closed. Example of what should be added:

        if (!!speechSynthesizerRef.current) {
            speechSynthesizerRef.current.close();
             speechSynthesizerRef.current = null;
        }
        if (!config) {
            const { token, region } = (await getAuthTokenAzure()) ?? {}
            if (!token || !region) return console.log(`Error getting token or region`)
            setConfig({ t: token, r: region })
            const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(token, region)
            speechConfig.speechSynthesisVoiceName = voice
            speechSythesizerRef.current = new sdk.SpeechSynthesizer(speechConfig, audioConfig.current)
        } else {
            const speechConfig = sdk.SpeechConfig.fromAuthorizationToken(config?.t, config?.r)
            speechConfig.speechSynthesisVoiceName = voice
            speechSythesizerRef.current = new sdk.SpeechSynthesizer(speechConfig, audioConfig.current)
        }

Could you add that (and for the speechRecognizer as well) and see if that helps?

albseb511 commented 1 year ago

I think we do manage this. But let me see again If this is an issue. Since at almost every place we do manage garbage collection.

And we have noticed that it breaks for the first time. So there wouldn't have been other instances for this.

albseb511 commented 1 year ago

I think there was a miss of this at the beginning like you mentioned. we do cleanups, but only on failure or other cases, on change. i think it makes sense to have it at the start as well. We will test it with live users, and get back to you.

albseb511 commented 1 year ago

@glharper @yulin-li Thanks so much, i think most of the issues are sorted. i should have asked this earlier, this seemed to have been a miss for a while. Although we have observed 1-2 cases breaking, but i am assuming its due to some other reason.

A bit off topic, i had two more follow up questions

  1. i had observed sometimes latency in responses from the service. sometimes even upto 30 seconds. is that something that can happen? are there rate limits for subscription level? ( have not observed recently, but it did happen a few times, will monitor this )
  2. Are there any concurrent request limit on the service? would it be able to handle 1000 concurrent users for speech synthesis and recognition. I was looking at the docker containers, and it mentioned for 12g, 4 cores, 6 concurrent users it can manage.

If you can guide me on the right resources also it will be great.

glharper commented 1 year ago

@albseb511 Here is the quota and rate limit info for Azure subscriptions. It does appear that 1000 concurrent users may run into issues on a S0 subscription.