Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.94k stars 1.86k forks source link

SpeechRecognizer receives Canceled event with error "Due to service inactivity, the client buffer exceeded maximum size. Resetting the buffer" #2210

Closed jelizavetazaharova closed 9 months ago

jelizavetazaharova commented 10 months ago

Describe the bug SpeechRecognizer receives a Canceled event with the ErrorCode = ServiceTimeout and ErrorDetails = "Due to service inactivity, the client buffer exceeded maximum size. Resetting the buffer. SessionId: f5ffa7a174444e25a0baf52a639d545b" and stops processing the audio file

To Reproduce Steps to reproduce the behavior:

  1. Upload an audio file, which contains a silence fragment longer than 10 seconds or a music fragment
  2. Receive a Canceled event with the ServiceTimeout error code

Here is a code sample we are using:

using var audioConfig = AudioConfig.FromWavFileInput(filePath);
var speechRecognizer = new SpeechRecognizer(_speechConfig, audioConfig);

speechRecognizer.Recognized += (object? sender, SpeechRecognitionEventArgs speechEvent) =>
{
       _logger.LogDebug("RECOGNIZED {sessionID} text of {Duration} duration,  {Length} length,", speechEvent.SessionId, speechEvent.Result.Duration, speechEvent.Result.Text.Length);
}:
speechRecognizer.SessionStopped += (s, e) =>
{
       _logger.LogDebug("SESSION-END {sessionID}", e.SessionId);
};
speechRecognizer.Canceled += (s, e) =>
{
       _logger.LogDebug("SESSION-CANCELED {sessionID}: {Reason} - {errorCode}", e.SessionId, e.Reason, e.ErrorCode);
};

await speechRecognizer.StartContinuousRecognitionAsync();
...
await speechRecognizer.StopContinuousRecognitionAsync();

Expected behavior An audio file should be fully processed and the recognition should stop only after processing the file fully

Version of the Cognitive Services Speech SDK Version 1.33.0

Platform, Operating System, and Programming Language

Additional context For additional information attaching a log file logfile.txt

rhurey commented 10 months ago

Thanks for reporting this issue.

I looked at the telemetry for that session and the service didn't segment the 2nd phrase like I'd have expected it to.

I tried a couple of different audio files I created to get a repro but wasn't successful.

Do you have an audio file you can share? It will help our service team investigate the problem.

rhurey commented 10 months ago

Actually, looing at #2212 it occurs to me you may be hitting the same problem, but instead of having the audio segment, you're hitting the default maximum buffer size which is precariously close to the maximum phrase length.

Looking at the configuration for the request I do see the segmentation timeout is set at 5s, which is definitely a factor here.

jelizavetazaharova commented 10 months ago

Thank you @rhurey for your quick response Here is a failing part of the file: Audio.zip

After the first sentence, there is a silence fragment and then there are a few more sentences.

  1. We receive the first transcribed sentence on the Recognized event with the following transcript: Earbuds, come on, go, go, go, go.
  2. Then we receive a Canceled event with the ServiceTimeout error
  3. And then one more Recognized event is received with the following recognition result: I think I got it. Good. Yeah. I'm so sorry about. Yeah. I should not be doing interviews right after getting home from an international trip. I'm just like a complete mess. So, yeah. How are you doing? Not bad. It's actually kind of slow this week. So it worked out. OK, OK, good. I am very glad. So, yeah, I mean, I kind of I messaged you like, in the e-mail, I already kind of explained that this is a short profile for decibel. And so generally what I do with these profiles, because they are so short, is I only ask maximum two or three questions just because I think it's more effective for everyone's time. And the main thing that I wanted to ask you about with the new Stygian Crown record was there was a comment that the band made about the theme of the record. And I thought that it was pretty interesting and I wanted to dig into it further. And it was about how the theme of the album is about the origin of monsters and how they're conceptualized.
  4. The recognition of the file stops, so the rest of the file is not processed.

As for the segmentation timeout, we set the property SegmentationSilenceTimeoutMs value to 5000. It was stated in the documentation, that the higher value allows having longer pauses in a speech.

rhurey commented 9 months ago

Thanks for the follow-up.

Turning up the segmentation timeout does allow for longer pauses in speech that will be recognized as a single phrase. There are a number of scenarios where it can be beneficial to allow for a longer than typical pause between words when recognizing a single phrase.

There are tradeoffs, though. The biggest being that the Speech Service will only return final recognized results after enough silence has happened, which can result in increased latency.

There are also other interconnected pieces that can increase friction.

One of which is the SDK's resiliency buffer that can hold just over a minute of audio, which is what's happening here. The service segmented the phrase at ~55 seconds, but that wound up just missing being quick enough for the buffer to not have filled.

jelizavetazaharova commented 9 months ago

@rhurey Referring to what you said, what would be a possible solution for us? Should we set segmentation silence timeout to a smaller value? If so, will it work for us if we want to allow longer pauses? Can this issue happen again if the smaller silence timeout value is set and the phrase is segmented at ~59 seconds?

rhurey commented 9 months ago

Good questions...

5s is a long time for someone to pause in the middle of what is a single sentence. Larger values are likely more useful if you're doing single phrase recognition as part of a command system. It looks like your scenario is more transcription based where something much closer to the default of 500ms will produce acceptable results.

Having a lower segmentation time will reduce the odds of filling the client buffer in a couple of different ways. First, the odds of phrases being segmented go up close to the defaults. Secondly, as the phrase length gets longer the Speech Service becomes more aggressive at detecting the end of a phrase, and starting from a lower initial number will definitely increase the chances of the phrase end being found before the client overflows.

anshika24khathuriya commented 9 months ago

SESSION STARTED: SessionEventArgs(session_id=0de3b2a37dbb444ab3b9bfcc25042e03) CANCELED SpeechRecognitionCanceledEventArgs(session_id=0de3b2a37dbb444ab3b9bfcc25042e03, result=SpeechRecognitionResult(result_id=5f7c56eb546448c4a24c09136284ce1f, text="", reason=ResultReason.Canceled))

Screenshot 2024-01-20 185728

please look at the above error and please help the same code working fine with local machine but give error in ubuntu server

rhurey commented 9 months ago

@anshika24khathuriya can you open a new issue? It helps keep problems separate and makes tracking progress easier.

anshika24khathuriya commented 9 months ago

@anshika24khathuriya can you open a new issue? It helps keep problems separate and makes tracking progress easier.

The issue is is gives wrong scores in every file it gives 0 0 0 0 scores

ralph-msft commented 9 months ago

@@anshika24khathuriya Please file a separate issue so we can better assist you.

Closing this out to keep our issue list current.

jelizavetazaharova commented 9 months ago

@rhurey Sorry for the late reply. We checked the solution you suggested, and the decreasing of the silence timeout to 500ms helped us to escape the ServiceTimeout issue. Thanks a lot for your help!