Closed Saturn49 closed 3 years ago
Thanks for reporting this issue!
We recently identified and fixed something that's quite likely related to this long Dispose()
you found. What target framework are you using (e.g. .NET Core, .NET 4.7.2)? I ask because the identified issue appears to be specific to .NET Core -- or at least it doesn't repro on plain "legacy" .NET environments, which is why it was missed.
If this is the same problem:
If the problem persists with 1.15+ or 1.10, we'd like to get more information to further investigate. If you could collect a log via the steps located here, it'd be a huge help: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/how-to-use-logging
Thanks again!
I'm using a Winforms app targetting .NET Framework 4.7.2
My workaround was to abandon the Microsoft-provided library and go straight to the REST api directly.
I could probably revert to the SDK and gather logs if that would be helpful.
ms_speech_sdk.log Log attached. It should have instances of the recognizer destructor taking 1135 and 1118ms each.
Thanks, the log looks great and pretty clearly confirms the same issue we worked on:
(728292): 9879ms SPX_DBG_TRACE_SCOPE_ENTER: thread_service.cpp:44 CSpxThreadService::Term
(728292): 10966ms SPX_DBG_TRACE_SCOPE_EXIT: thread_service.cpp:44 CSpxThreadService::Term
That's the phantom 1000-1100ms that gets eaten up. That's great information that it's reproducing with winforms independently of .NET Core, too -- we'll investigate that on our side but I'm optimistic that 1.15 (very, very imminent) or 1.10 will both not hit the delay.
Actually, just spun up a Windows Forms app for the repro and confirmed both the problem (1.14 has ~1100ms per dispose) and fix (1.15 will not do this) in that environment, also with 4.7.2. So I'm not just optimistic, but confident it's the same problem. Whew!
Thanks for the quick response Travis! I'm glad your imminent release fixes this issue.
However, I suspect I'll stick with my raw REST implementation anyway, as it is still significantly faster to get a response (~200ms vs 500+ms) for my particular use case.
Depending on your scenario, something like the batch transcription (REST) APIs might make a lot of sense, too: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/batch-transcription
How many files/streams are you looping through? An advantage of the above (or simulating it yourself, though that's not recommended) is being able to parallelize easily (and not have to sequentially wait for things like connections to establish). Depending on your use case and priorities, we can probably recommend a good approach.
Startup with the full client SDK is a little longer (owing to establishment of a WebSocket connection for fast back-and-forth), but the tradeoff is getting much quicker responses for real-time audio such as intermediate responses. REST/batch is for sure the right approach for "offline" transcription of a bunch of saved audio, though.
My application is simple - math flash cards for kids. So it is single-word recognition (numbers only) with accurate but as fast of a response as possible. I had originally tried continuous dictation but I need answers finalized ASAP (and sending silence seemed wasteful and costly). The Speech-to-text REST API for short audio is what I've settled on: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/rest-speech-to-text#speech-to-text-rest-api-for-short-audio
I would love something that kept an established connection, but I need a way to send really short (~1 sec) audio every few seconds. I couldn't find a way to reuse a Recognizer though.
Got it -- thanks for the great detail! And cool app.
I do think we can derive a way for you to reuse the recognizer and I suspect that'll give you even better behavior, given you get to keep a "hot" connection. And full disclaimer, I'm not trying to sell one vs. the other--we do both!--nor sway you from anything if it's working well.
For reuse, the easiest approach, if it's not prohibitive to make work, is to use microphone audio input in the config (FromDefaultMicrophoneInput or omitted AudioConfig) and just call RecognizeOnceAsync repeatedly. E.g. something like this:
using (var recognizer = new SpeechRecognizer(config))
{
for (int i = 0; i < 5; i++)
{
Print($"Listening...");
var result = await recognizer.RecognizeOnceAsync();
Print($"Result: {result.Text}");
}
}
should give something like:
[39:00.091] Listening...
[39:02.774] Result: 1.
[39:02.808] Listening...
[39:05.093] Result: 2.
[39:05.120] Listening...
[39:07.325] Result: 3.
[39:07.365] Listening...
[39:09.716] Result: Poor.
[39:09.757] Listening...
[39:11.805] Result: 5.
Side note: the "4" misreco can be helped out by using PhraseListGrammar as outlined here: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-speech-to-text?tabs=script%2Cbrowser%2Cwindowsinstall&pivots=programming-language-csharp#improve-recognition-accuracy
If you don't have an easy way to just let the SDK handle that aspect of it and you need to write a buffer from another source as your audio, you can still achieve the same effect in a somewhat more involved way. Writing an audio stream adapter that proxies the audio from multiple sources can let you use the same config/recognizer across multiple "source writes." This shows up in the samples in a few places, with one that I'm being familiar with being from this not-yet-merged PR into our dialog sample: https://github.com/Azure-Samples/Cognitive-Services-Voice-Assistant/blob/user/travisw/keywordRecognizer/clients/csharp-uwp/UWPVoiceAssistantSample/AudioInput/PullAudioInputSink.cs
There's more going on there than what's strictly needed, but the general idea is:
Using a hastily trimmed/modified version of the file linked above, as an example:
var audioAdapter = new PullAudioInputSink();
// var speechConfig = SpeechConfig.FromSubscription(...);
var audioConfig = AudioConfig.FromStreamInput(audioAdapter);
using (var recognizer = new SpeechRecognizer(speechConfig, audioConfig))
{
recognizer.SessionStarted += (_, e) => Print($"Session started");
recognizer.Recognizing += (_, e) => Print($"Recognizing: {e.Result.Text}");
for (int i = 0; i < 5; i++)
{
Print($"Listening...");
audioAdapter.Reset();
audioAdapter.PushData(File.ReadAllBytes("123.raw"));
var result = await recognizer.RecognizeOnceAsync();
await recognizer.StopContinuousRecognitionAsync();
Print($"Result: {result.Text}");
}
}
Does about what you'd expect:
[30:10.954] Listening...
[30:10.975] Session started
[30:12.345] Recognizing: 1
[30:12.436] Recognizing: 12
[30:12.548] Recognizing: 123
[30:12.674] Result: 123
[30:12.675] Listening...
[30:12.676] Session started
[30:14.560] Result:
[30:14.561] Listening...
[30:14.562] Session started
[30:15.689] Recognizing: 1
[30:15.780] Recognizing: 12
[30:15.873] Recognizing: 123
[30:16.058] Result: 123
[30:16.058] Listening...
[30:16.060] Session started
[30:17.310] Recognizing: 1
[30:17.420] Recognizing: 12
[30:17.511] Recognizing: 123
[30:17.621] Result: 123
[30:17.621] Listening...
[30:17.624] Session started
[30:18.635] Recognizing: 1
[30:18.744] Recognizing: 12
[30:18.839] Recognizing: 123
[30:18.931] Result: 123
Part of the benefit of using the WebSocket-based approach vs. REST is the presence of those nice intermediate results. Especially if you're able to do real-time input, being able to give feedback as the user is speaking can be pretty nice!
Thanks very much for the reply and sample code. It is good to know that I can use RecognizeOnceAsync repeatedly. It is unfortunate that it has to be fed realtime. In my first attempt I used the sample code that didn't close the audioInputStream: https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/get-started-speech-to-text?tabs=script%2Cbrowser%2Cwindowsinstall&pivots=programming-language-csharp ("Recognize from in-memory stream") and it kept timing out since the service was apparently looking for a silence that didn't exist.
Anyway, I may switch back to something like this (with the aforementioned PhraseListGrammar) to improve accuracy, but honestly the REST solution is working fine - from detecting the end of word silence (my own code) to getting a response from the API is around 300ms, which makes for a fairly timely response (correct or incorrect). I hard-coded a few substitutions it was getting wrong with my son ("ken" instead of "10") but the accuracy was quite acceptable even without that.
That's great that it's working well! The "ken"/"10" style of substitutions can be quite aggravating but it sounds like you found a good workable approach to handle it. PhraseListGrammar
is overkill if that does a good enough job, after all.
Please let us know if you hit anything else--I'll close this issue to keep things tidy, but stay in touch!
I'm trying to create a C# app that needs to recognize single words very quickly. I've used RecognizeOnceAsync() on an in-memory stream and while the recognition works fine, the Microsoft.CognitiveServices.Speech.SpeechRecognizer can only be used once and takes 1100ms or more to destruct.
My code is fairly straightforward:
The result is surprising:
Is there any way to reuse a Recognizer? Any idea what's taking 1+ seconds to dispose of one?