Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.77k stars 1.82k forks source link

sessionStarted is not triggered sometimes on Android platform #752

Closed jzhw0130 closed 3 years ago

jzhw0130 commented 4 years ago

Describe the bug Create a speechRecognizer, call startContinuousRecognitionAsync, speak something, wait for the recognized result, then call stopContinuousRecognitionAsync. Repeat the start --> stop process. After N times; call start, sessionStarted is not triggered, so do other listeners. After that, speech service is no longer available any more. I need to kill the app..

To Reproduce Steps to reproduce the behavior:

  1. speechRecognizer = new SpeechRecognizer()
  2. Add listeners: speechRecognizer.sessionStarted.addEventListener() ...
  3. speechRecognizer.startContinuousRecognitionAsync()
  4. wait for recognized result
  5. speechRecognizer.stopContinuousRecognitionAsync()
  6. repeat step 3 - step 5

Expected behavior Repeat step 3 - step 5, speech service always works, no bug occurs.

Version of the Cognitive Services Speech SDK 'com.microsoft.cognitiveservices.speech:client-sdk:1.13.0'

Platform, Operating System, and Programming Language

jzhw0130 commented 4 years ago

my first question is: When the sessionStarted is not triggered after call startContinuousRecognitionAsync().

glharper commented 4 years ago

Hi, thanks for logging this issue. To clarify, are you waiting for the return from speechRecognizer.stopContinuousRecognitionAsync.get()? I did not notice a ".get()" part in step 5 above. Also, are you waiting to receive a canceled event before repeating steps 3-5?

jzhw0130 commented 4 years ago

@glharper Thanks for your help. I don't wait for the return from speechRecognizer.stopContinuousRecognitionAsync.get().
Since .get() will block the current thread, right? I do listen canceled event, speechRecognizer.canceled.addEventListener(), but only used when no recognized result is returned.

glharper commented 4 years ago

@jzhw0130 sessionStopped (not canceled) is the event to wait for before restarting recognition. Also, you should make sure the stopContinuousRecognitionAsync call has completed (in whatever manner of your choosing for awaiting an asynchronous method) before restarting, not doing so can put the recognizer in an undefined state. Are you able to reproduce after taking these measures?

jzhw0130 commented 4 years ago

@glharper Thanks for your support. I will try your suggestion. I still want to known in which case sessionStarted is not triggered after all startContinuousRecognitionAsync()?

jzhw0130 commented 4 years ago

@glharper BTW, is there a way to get the recognizer status directly except listening the events?

jzhw0130 commented 4 years ago

@glharper As you suggested, I encountered 2 weird issues:

  1. It cost about 1 minute to get sessionStopped after call stopContinuousRecognitionAsync() !!!
  2. Get cached recognized result from the past round

Refer to the logs:

// ROUND 1 2020-08-02 13:53:19.622 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream startRecording() 2020-08-02 13:53:19.638 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: startContinuousRecognitionAsync() 2020-08-02 13:53:19.652 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> callback sessionStarted triggered 2020-08-02 13:53:19.815 16972-17058/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 19.194327739455797 ... 2020-08-02 13:53:49.618 16972-17058/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 24.44538076510554 2020-08-02 13:53:49.664 16972-17067/com.mosaix.sdkdemo V/MosaixSDK: MicrophoneStream stopRecording() 2020-08-02 13:53:49.664 16972-17067/com.mosaix.sdkdemo I/MosaixSDK: stopContinuousRecognitionAsync() 2020-08-02 13:54:29.196 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback recognizing triggered. !!!!! 2020-08-02 13:54:37.299 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback recognizing triggered. !!!!! 2020-08-02 13:54:37.819 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback recognized triggered. !!!!! 2020-08-02 13:54:48.133 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> SessionStoppedHandler. !!!!!

// ROUND 2 2020-08-02 13:56:17.248 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream startRecording() 2020-08-02 13:56:17.274 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: startContinuousRecognitionAsync() 2020-08-02 13:56:17.290 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> callback sessionStarted triggered 2020-08-02 13:56:17.459 16972-17669/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 25.26270201624833 // Speak how come … 2020-08-02 13:56:19.459 16972-17669/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 34.33886589857751 2020-08-02 13:56:19.538 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback ResultReason.NoMatch triggered 2020-08-02 13:56:19.539 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream stopRecording() 2020-08-02 13:56:19.589 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: stopContinuousRecognitionAsync(). 2020-08-02 13:56:20.905 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> SessionStoppedHandler.

// ROUND 3 2020-08-02 13:56:28.279 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream startRecording() 2020-08-02 13:56:28.302 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: startContinuousRecognitionAsync() 2020-08-02 13:56:28.314 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> callback sessionStarted triggered 2020-08-02 13:56:28.481 16972-18064/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 27.82586154758063. // Never speak … 2020-08-02 13:56:29.785 16972-18064/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 34.708136415663276 2020-08-02 13:56:29.838 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: callback recognizing triggered: how. // But received recognized result from the past round 2020-08-02 13:56:29.979 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: callback recognizing triggered: how come 2020-08-02 13:56:29.982 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: callback recognized triggered: How come 2020-08-02 13:56:30.015 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: stopContinuousRecognitionAsync() 2020-08-02 13:56:30.015 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream stopRecording() 2020-08-02 13:56:30.971 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> SessionStoppedHandler

glharper commented 4 years ago

@jzhw0130 Thanks for your work reproducing this. Have you been able to reproduce the original issue (sessionStarted not called) in your test? Answers to your questions/issues:

jzhw0130 commented 4 years ago

@glharper Thanks for your quick response.

Since there's a long delay between stopContinuousRecognitionAsync() and SessionStoppedHandler triggered. If I call startContinuousRecognitionAsync() before SessionStoppedHandler triggered, it will reproduce the original issue (sessionStarted not called)

glharper commented 4 years ago

@jhakulin I've added a bug with a quickly started session triggering recognized events from a previous session. Is there also a perfomance issue here with sessionStopped not being sent for a minute after stopContinuousRecognition has been called?

rhurey commented 4 years ago

So internally what's happening is that StopContinuousRecognitionAsync is that the SDK will tell the Speech Service that its done sending audio, and it will wait for the Speech Service to return the results for all the audio that's been sent, or 10 seconds to happen.

Normally this delay is small as the Speech Service isn't more than a few hundred ms behind the SDK in audio recognition, especially for a microphone source where audio is always available.

Sometimes that's not the case, and things get less predictable.

From the log, @jzhw0130 do you happen to be sending audio in through a Push or a Pull Stream? The rules around preserving unsent audio between start / stop calls are different for files and streams vs microphones, I see your log has entries like: "MicrophoneStream startRecording" And I'm wondering if the app is handling the microphone directly and pushing the audio into the SDK?

I've got a few theories that I think at least explain the behaviors you're seeing, but they depend on how the audio is being delivered to the SDK.

jzhw0130 commented 4 years ago

@rhurey Thanks for your help. Yep, I create the AudioRecord instance, and set it as the stream input of speechRecognizer.

Refer to code:

public class MicrophoneStream extends PullAudioInputStreamCallback  {
     ...
    public MicrophoneStream() {
        this.recorder = new AudioRecord.Builder()
                .setAudioSource(MediaRecorder.AudioSource.VOICE_RECOGNITION)
                .setAudioFormat(af)
                .build();
    }

    @Override
    public int read(byte[] bytes) {
        long ret = this.recorder.read(bytes, 0, bytes.length);
        return (int)ret;
    }
    ....
}
microphoneStream = new MicrophoneStream();
speechRecognizer = new SpeechRecognizer(config, preferredLanguage, AudioConfig.fromStreamInput(microphoneStream));
jzhw0130 commented 4 years ago

@rhurey BTW, I replicated the issue and caught the azure audio log, help check, thx I tested 2 times, it works well for 1st time, but did not work for 2nd time.

2020-08-10 09_57_36 -azure-speech-audio.txt

rhurey commented 4 years ago

Log was useful...

So in the log we have 2 phrases and a lot of silence.

The first phrase everything looks ok, the duration of transmitted audio was ~5 seconds, and when stop was called the service had recognized ~3.1s of speech and acknowledged the full 5s of audio. (Mostly, about 640b (20ms) remained in the transmit buffer.)

When the 2nd session started up, that remaining 640b was sent and audio was pulled in. There's about 1.5s of silence and 2.5s of speech. At this point latency with the service is running ~1s, while on the high side not horrible. About 20s later the service reports that it's been receiving silence for 20s. Latency is running ~650ms behind real time. ~25s later we get a message that the service has seen 5s of silence. At this point things are running ~22s behind real time. The 5s silence notifications go on until the 2 minute mark when the SDK reconnects. Over the next 30 seconds recognition catches back up and the service winds up ~330ms behind.

What's happening is that the recognition in the speech service is (for some reason) running behind occasionally. When this occurs the client SDK will endeavor to make sure all speech is recognized by replaying buffered speech, and has some rules around how that works.

When a call is made to stop recognition, the SDK will stop sending audio and wait for the Speech Service to catch up and report its done with recognition.

There's three criteria (roughly) for the Speech Service to report being done with recognition. 1) The Speech Service hits a marker in the audio stream placed by the SDK to indicate quitting. This marker is placed when there's no more audio (because an end of file was reached) or a call to stop recognizing was made. 2) The maximum silence before or after speech has been reached. (Typically defaulting in the 20s range) 3) The recognition requested was for only a single phrase.

In the event #2 happens and continuous speech recognition has been requested, the SDK will signal the Speech Service to continue recognition anyways.

With that information as a foundation...

// ROUND 1 2020-08-02 13:53:19.622 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream startRecording() 2020-08-02 13:53:19.638 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: startContinuousRecognitionAsync() 2020-08-02 13:53:19.652 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> callback sessionStarted triggered 2020-08-02 13:53:19.815 16972-17058/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 19.194327739455797 ... 2020-08-02 13:53:49.618 16972-17058/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 24.44538076510554 2020-08-02 13:53:49.664 16972-17067/com.mosaix.sdkdemo V/MosaixSDK: MicrophoneStream stopRecording() 2020-08-02 13:53:49.664 16972-17067/com.mosaix.sdkdemo I/MosaixSDK: stopContinuousRecognitionAsync() 2020-08-02 13:54:29.196 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback recognizing triggered. !!!!! 2020-08-02 13:54:37.299 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback recognizing triggered. !!!!! 2020-08-02 13:54:37.819 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback recognized triggered. !!!!! 2020-08-02 13:54:48.133 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> SessionStoppedHandler. !!!!!

Here the odds are the Speech Service got behind in the first recognition, and there's ~30 seconds of audio it's recorded, but it's not clear how much it took to match the recognition.

// ROUND 2 2020-08-02 13:56:17.248 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream startRecording() 2020-08-02 13:56:17.274 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: startContinuousRecognitionAsync() 2020-08-02 13:56:17.290 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> callback sessionStarted triggered 2020-08-02 13:56:17.459 16972-17669/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 25.26270201624833 // Speak how come … 2020-08-02 13:56:19.459 16972-17669/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 34.33886589857751 2020-08-02 13:56:19.538 16972-17046/com.mosaix.sdkdemo W/MosaixSDK: ---> callback ResultReason.NoMatch triggered 2020-08-02 13:56:19.539 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream stopRecording() 2020-08-02 13:56:19.589 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: stopContinuousRecognitionAsync(). 2020-08-02 13:56:20.905 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> SessionStoppedHandler.

Here, I suspect with the InitialSilenceTimeout value of 5s, the NoMatch is in regards to cached audio from the previous recognition being recognized.

// ROUND 3 2020-08-02 13:56:28.279 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream startRecording() 2020-08-02 13:56:28.302 16972-16972/com.mosaix.sdkdemo I/MosaixSDK: startContinuousRecognitionAsync() 2020-08-02 13:56:28.314 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> callback sessionStarted triggered 2020-08-02 13:56:28.481 16972-18064/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 27.82586154758063. // Never speak … 2020-08-02 13:56:29.785 16972-18064/com.mosaix.sdkdemo V/MosaixSDK: azure audio level: 34.708136415663276 2020-08-02 13:56:29.838 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: callback recognizing triggered: how. // But received recognized result from the past round 2020-08-02 13:56:29.979 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: callback recognizing triggered: how come 2020-08-02 13:56:29.982 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: callback recognized triggered: How come 2020-08-02 13:56:30.015 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: stopContinuousRecognitionAsync() 2020-08-02 13:56:30.015 16972-17046/com.mosaix.sdkdemo I/MosaixSDK: MicrophoneStream stopRecording() 2020-08-02 13:56:30.971 16972-17046/com.mosaix.sdkdemo V/MosaixSDK: ---> SessionStoppedHandler

And now we see the last of the unrecognized audio get recognized.

I'll speak with some folks from our service team and see if we can determine what happened to the session from the log you sent on the service side. The session ID is the important part from the client to trace things through the service.

jzhw0130 commented 4 years ago

@rhurey Thank you for your detailed explanation. Let's wait for the investigation from your service team.

rhurey commented 4 years ago

Got some information to add.

There's a ~23 second gap in audio arrival at the service that isn't accounted for in the logs where the platform under the SDK was given audio packets at a consistent rate. This gap accounts for the large fall behind real-time transcription.

Is there any connectivity tracing available from the device? I noticed you have another issue with problems connecting...

jzhw0130 commented 4 years ago

@rhurey

They are the same issue on different level. The current issue is on APPLICATION level, the other is on SDK level.

When the issue occured, I tried to play youtube videos, it worked, seems the network is ok

rhurey commented 4 years ago

They all feel slightly related based on SDK behaviors.

Looking at things, the log / session ID we have in this issue is likely caused by two factors: 1) The service got behind due to a 20 second audio arrival gap. 2) The SDK when stopping waits 10 seconds for the service and then returns, but can still receive events.

Looking at issue #696, calling Close() will internally call Stop() to put an internal state machine into a known state before destroying resources, which can be subject to the same delay.

700 is likely also some fallout from the same behaviors.

So.... What to do....

@glecaros picked up a task to improve the stop algorithm to both be faster with the service isn't responding quickly and do a better job of terminating the underlying connection to prevent zombie events.

I opened another task to look at a way to get the volume information out of the SDK so you wouldn't need to use a stream to hand the audio in, I suspect that may help with the audio caching.

As for the gaps in audio arrival. A bit stuck here. We'd need to look at more sessions with problems to see if a pattern can be found.

jzhw0130 commented 4 years ago

@rhurey Thanks for your help.

BTW, We still need to keep stream audio feature.
Since our TV application support two audio inputs, one comes from remote control (mic), the other one comes from mobile application which is installed on the phone (audio input stream).

pankopon commented 3 years ago

Closing this as we now have a work item for this issue on the backlog to be prioritized and planned. We'll update with information about the release schedule when known. Currently there is a tentative target for some improvement on this and related issues in the Speech SDK 1.16 release next year.