Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.82k stars 1.82k forks source link

C# Microsoft.CognitiveServices.Speech.SpeechRecognizer takes 1+ second to dispose #937

Closed Saturn49 closed 3 years ago

Saturn49 commented 3 years ago

I'm trying to create a C# app that needs to recognize single words very quickly. I've used RecognizeOnceAsync() on an in-memory stream and while the recognition works fine, the Microsoft.CognitiveServices.Speech.SpeechRecognizer can only be used once and takes 1100ms or more to destruct.

My code is fairly straightforward:

                var current = DateTime.Now;
                var prev = DateTime.Now;
                using (var audioInputStream = AudioInputStream.CreatePushStream()) 
                {
                    current = DateTime.Now;
                    System.Console.WriteLine("Constructed stream:" + (current - prev).TotalMilliseconds);
                    prev = current;
                    using (var audioConfig = AudioConfig.FromStreamInput(audioInputStream))
                    {
                        current = DateTime.Now;
                        System.Console.WriteLine("Constructed config:" + (current - prev).TotalMilliseconds);
                        prev = current;
                        using (var recognizer = new Microsoft.CognitiveServices.Speech.SpeechRecognizer(msSpeechConfig, audioConfig))
                        {
                            current = DateTime.Now;
                            System.Console.WriteLine("Constructed recognizer:" + (current - prev).TotalMilliseconds);
                            prev = current;
                            var local_buffer = buffer_;
                            buffer_ = new MemoryStream();
                            local_buffer.Position = 0;
                            audioInputStream.Write(local_buffer.ToArray());
                            audioInputStream.Close();

                            var result = await recognizer.RecognizeOnceAsync();
                            this.SpeechResult(result.Text);  // Do something with the result
                            current = DateTime.Now;
                            System.Console.WriteLine("Got Result:" + (current - prev).TotalMilliseconds);
                            prev = current;
                        }
                        current = DateTime.Now;
                        System.Console.WriteLine("Destructed recognizer:" + (current - prev).TotalMilliseconds);
                        prev = current;
                    }
                    current = DateTime.Now;
                    System.Console.WriteLine("Destructed config:" + (current - prev).TotalMilliseconds);
                    prev = current;
                }
                current = DateTime.Now;
                System.Console.WriteLine("Destructed Stream:" + (current - prev).TotalMilliseconds);
                prev = current;

The result is surprising:

Constructed stream:0
Constructed config:0
Constructed recognizer:0.9996
Got Result:609.4551
Destructed recognizer:1122.9987
Destructed config:0
Destructed Stream:0
Constructed stream:0
Constructed config:0
Constructed recognizer:0.9995
Got Result:530.5271
Destructed recognizer:1110.0095
Destructed config:0
Destructed Stream:0

Is there any way to reuse a Recognizer? Any idea what's taking 1+ seconds to dispose of one?

trrwilson commented 3 years ago

Thanks for reporting this issue!

We recently identified and fixed something that's quite likely related to this long Dispose() you found. What target framework are you using (e.g. .NET Core, .NET 4.7.2)? I ask because the identified issue appears to be specific to .NET Core -- or at least it doesn't repro on plain "legacy" .NET environments, which is why it was missed.

If this is the same problem:

If the problem persists with 1.15+ or 1.10, we'd like to get more information to further investigate. If you could collect a log via the steps located here, it'd be a huge help: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-logging

Thanks again!

Saturn49 commented 3 years ago

I'm using a Winforms app targetting .NET Framework 4.7.2

My workaround was to abandon the Microsoft-provided library and go straight to the REST api directly.

I could probably revert to the SDK and gather logs if that would be helpful.

Saturn49 commented 3 years ago

ms_speech_sdk.log Log attached. It should have instances of the recognizer destructor taking 1135 and 1118ms each.

trrwilson commented 3 years ago

Thanks, the log looks great and pretty clearly confirms the same issue we worked on:

(728292): 9879ms SPX_DBG_TRACE_SCOPE_ENTER:  thread_service.cpp:44 CSpxThreadService::Term
(728292): 10966ms SPX_DBG_TRACE_SCOPE_EXIT:  thread_service.cpp:44 CSpxThreadService::Term

That's the phantom 1000-1100ms that gets eaten up. That's great information that it's reproducing with winforms independently of .NET Core, too -- we'll investigate that on our side but I'm optimistic that 1.15 (very, very imminent) or 1.10 will both not hit the delay.

trrwilson commented 3 years ago

Actually, just spun up a Windows Forms app for the repro and confirmed both the problem (1.14 has ~1100ms per dispose) and fix (1.15 will not do this) in that environment, also with 4.7.2. So I'm not just optimistic, but confident it's the same problem. Whew!

Saturn49 commented 3 years ago

Thanks for the quick response Travis! I'm glad your imminent release fixes this issue.

However, I suspect I'll stick with my raw REST implementation anyway, as it is still significantly faster to get a response (~200ms vs 500+ms) for my particular use case.

trrwilson commented 3 years ago

Depending on your scenario, something like the batch transcription (REST) APIs might make a lot of sense, too: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription

How many files/streams are you looping through? An advantage of the above (or simulating it yourself, though that's not recommended) is being able to parallelize easily (and not have to sequentially wait for things like connections to establish). Depending on your use case and priorities, we can probably recommend a good approach.

Startup with the full client SDK is a little longer (owing to establishment of a WebSocket connection for fast back-and-forth), but the tradeoff is getting much quicker responses for real-time audio such as intermediate responses. REST/batch is for sure the right approach for "offline" transcription of a bunch of saved audio, though.

Saturn49 commented 3 years ago

My application is simple - math flash cards for kids. So it is single-word recognition (numbers only) with accurate but as fast of a response as possible. I had originally tried continuous dictation but I need answers finalized ASAP (and sending silence seemed wasteful and costly). The Speech-to-text REST API for short audio is what I've settled on: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text#speech-to-text-rest-api-for-short-audio

I would love something that kept an established connection, but I need a way to send really short (~1 sec) audio every few seconds. I couldn't find a way to reuse a Recognizer though.

trrwilson commented 3 years ago

Got it -- thanks for the great detail! And cool app.

I do think we can derive a way for you to reuse the recognizer and I suspect that'll give you even better behavior, given you get to keep a "hot" connection. And full disclaimer, I'm not trying to sell one vs. the other--we do both!--nor sway you from anything if it's working well.

For reuse, the easiest approach, if it's not prohibitive to make work, is to use microphone audio input in the config (FromDefaultMicrophoneInput or omitted AudioConfig) and just call RecognizeOnceAsync repeatedly. E.g. something like this:

            using (var recognizer = new SpeechRecognizer(config))
            {
                for (int i = 0; i < 5; i++)
                {
                    Print($"Listening...");
                    var result = await recognizer.RecognizeOnceAsync();
                    Print($"Result: {result.Text}");
                }
            }

should give something like:

[39:00.091] Listening...
[39:02.774] Result: 1.
[39:02.808] Listening...
[39:05.093] Result: 2.
[39:05.120] Listening...
[39:07.325] Result: 3.
[39:07.365] Listening...
[39:09.716] Result: Poor.
[39:09.757] Listening...
[39:11.805] Result: 5.

Side note: the "4" misreco can be helped out by using PhraseListGrammar as outlined here: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-speech-to-text?tabs=script%2Cbrowser%2Cwindowsinstall&pivots=programming-language-csharp#improve-recognition-accuracy

If you don't have an easy way to just let the SDK handle that aspect of it and you need to write a buffer from another source as your audio, you can still achieve the same effect in a somewhat more involved way. Writing an audio stream adapter that proxies the audio from multiple sources can let you use the same config/recognizer across multiple "source writes." This shows up in the samples in a few places, with one that I'm being familiar with being from this not-yet-merged PR into our dialog sample: https://github.com/Azure-Samples/Cognitive-Services-Voice-Assistant/blob/user/travisw/keywordRecognizer/clients/csharp-uwp/UWPVoiceAssistantSample/AudioInput/PullAudioInputSink.cs

There's more going on there than what's strictly needed, but the general idea is:

Using a hastily trimmed/modified version of the file linked above, as an example:

            var audioAdapter = new PullAudioInputSink();

            // var speechConfig = SpeechConfig.FromSubscription(...);
            var audioConfig = AudioConfig.FromStreamInput(audioAdapter);

            using (var recognizer = new SpeechRecognizer(speechConfig, audioConfig))
            {
                recognizer.SessionStarted += (_, e) => Print($"Session started");
                recognizer.Recognizing += (_, e) => Print($"Recognizing: {e.Result.Text}");
                for (int i = 0; i < 5; i++)
                {
                    Print($"Listening...");
                    audioAdapter.Reset();
                    audioAdapter.PushData(File.ReadAllBytes("123.raw"));
                    var result = await recognizer.RecognizeOnceAsync();
                    await recognizer.StopContinuousRecognitionAsync();
                    Print($"Result: {result.Text}");
                }
            }

Does about what you'd expect:

[30:10.954] Listening...
[30:10.975] Session started
[30:12.345] Recognizing: 1
[30:12.436] Recognizing: 12
[30:12.548] Recognizing: 123
[30:12.674] Result: 123
[30:12.675] Listening...
[30:12.676] Session started
[30:14.560] Result:
[30:14.561] Listening...
[30:14.562] Session started
[30:15.689] Recognizing: 1
[30:15.780] Recognizing: 12
[30:15.873] Recognizing: 123
[30:16.058] Result: 123
[30:16.058] Listening...
[30:16.060] Session started
[30:17.310] Recognizing: 1
[30:17.420] Recognizing: 12
[30:17.511] Recognizing: 123
[30:17.621] Result: 123
[30:17.621] Listening...
[30:17.624] Session started
[30:18.635] Recognizing: 1
[30:18.744] Recognizing: 12
[30:18.839] Recognizing: 123
[30:18.931] Result: 123

Part of the benefit of using the WebSocket-based approach vs. REST is the presence of those nice intermediate results. Especially if you're able to do real-time input, being able to give feedback as the user is speaking can be pretty nice!

Saturn49 commented 3 years ago

Thanks very much for the reply and sample code. It is good to know that I can use RecognizeOnceAsync repeatedly. It is unfortunate that it has to be fed realtime. In my first attempt I used the sample code that didn't close the audioInputStream: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-speech-to-text?tabs=script%2Cbrowser%2Cwindowsinstall&pivots=programming-language-csharp ("Recognize from in-memory stream") and it kept timing out since the service was apparently looking for a silence that didn't exist.

Anyway, I may switch back to something like this (with the aforementioned PhraseListGrammar) to improve accuracy, but honestly the REST solution is working fine - from detecting the end of word silence (my own code) to getting a response from the API is around 300ms, which makes for a fairly timely response (correct or incorrect). I hard-coded a few substitutions it was getting wrong with my son ("ken" instead of "10") but the accuracy was quite acceptable even without that.

trrwilson commented 3 years ago

That's great that it's working well! The "ken"/"10" style of substitutions can be quite aggravating but it sounds like you found a good workable approach to handle it. PhraseListGrammar is overkill if that does a good enough job, after all.

Please let us know if you hit anything else--I'll close this issue to keep things tidy, but stay in touch!