Async speech recognition from file with Java SDK is segmentated after every minute

sc-nm commented 8 months ago

Hi,

Describe the bug I need to transcribe audio which is longer than 1 minute with the java SDK, but it seems that the result is always segmentated at every full minute. I can reproduce the issue with the java example code for async speech recognition from file: https://github.com/Azure-Samples/cognitive-services-speech-sdk/blob/master/samples/java/jre/console/src/com/microsoft/cognitiveservices/speech/samples/console/Main.java (menu 5). I've set the segmentation silence timeout to the maximum (5000 ms) with

config.setProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, String.valueOf(5000L));

My test audio (attached) does not contain any parts with silence longer than a few hundred milliseconds, but the result is still segmentated, at exactly every full minute. I have not found anything in the SDK documentation on how to change or disable this behavior, so I assume this is a bug.

To Reproduce

Load Java example project
go to SpeechRecognitionSamples::continuousRecognitionWithFileAsync:
- add subscription information
- add this line: config.setProperty(PropertyId.Speech_SegmentationSilenceTimeoutMs, String.valueOf(5000L));
- change audio file path/name to point to the test wav file in the attached zip: speech.wav.zip
run Main.class
choose "5. Speech continuous recognition using events with file."

Expected behavior One single recognized event with the full recognized text (no segmentations).

Version of the Cognitive Services Speech SDK 1.34.0

Platform, Operating System, and Programming Language

Windows 10
Java SDK

Additional context speech.log attached: speech.log.zip

rhurey commented 8 months ago

Thanks for the audio and the session id, it made figuring this out easy.

When it comes to phrases, there are a few different things that come into play for determining when the Speech Service will segment the audio and return a phrase. As you've seen, you can control the quiet time between speech using the segmentation timeout to lengthen (or shorten) how quickly the service will segment phrases.

But segmentation time isn't the only factor. There are also parameters around phrase length, which aren't exposed for the SDK to control. The models using for Speech Recognition are at their most accurate and lowest latency for phrases that are of a typical length for speech. As a result there is a maximum phrase length. Right now that's set at 60 seconds, as you've noticed.

We'd intended the segmentation timeout to help developers make the tradeoff in scenarios between a lower latency final result, and scenarios where end users were pausing during speech and the segmentation was happening too early. (Or conversely where users weren't pausing very much, and a more aggressive segmentation was needed.)

I'd love hear any feedback on where the existing API's aren't meeting your needs or have a higher friction to them then is preferable.

sc-nm commented 8 months ago

Hi,

thanks for the answer and the clearification. As far as I understand now, phrases are just not meant to be that long. Usually only as long as a sentence, maybe two, until the voice stops for a few hundred milliseconds. The model will never really "know", if the pause of the speaker is really the end of a sentence or phrase, so there must be some kind of limit, and that is the segmentation silence timeout. And that also highly depends on the person and how it speaks.

I guess I have two options now:

Use the maximum segmentation silence timeout to have the longest possible phrases, but then as a drawback often a forced "cut" at ~1 minute (not all of my audio files are that long, so that would be a feasable option). Concat the results of all recognized events at the very end.
Use a much lower segmentation silence timeout, which will lead to more smaller segments. Concat the results of all recognized events at the very end.

If there are any other options for proper transcribing of longer audio files, please let me know.

Best regards

rhurey commented 8 months ago

That's an accurate summary.

With the client side SDK you'll need to concatenate multiple phrases together. The SDK was designed around real-time interactive scenarios and it's API surface preferences those. It can transcribe longer audio files, but there isn't an API that takes a file in and returns the complete transcription in one call.

If you have files of audio to transcribe, Batch Transcription is an option.

Azure-Samples / cognitive-services-speech-sdk

Async speech recognition from file with Java SDK is segmentated after every minute #2212