Some question about speech , return nothing if audio begin with a long silence

RobinWitch commented 1 year ago

Package Name: azure.cognitiveservices.speech
Package Version: 1.28.0
Operating System:Ubuntu
Python Version: 3.10.11

Describe the bug If audio beginning have a silence slice about 10 -20 seconds , the process will auto disconnect and stop recognization .

The log is below:

SESSION STARTED: SessionEventArgs(session_id=049aabfc8ced4f11b216973d7d54f844)
CANCELED SpeechRecognitionCanceledEventArgs(session_id=049aabfc8ced4f11b216973d7d54f844, result=SpeechRecognitionResult(result_id=476e5db191d54de4a8eeaba4bcb798a4, text="", reason=ResultReason.Canceled))
CLOSING on SpeechRecognitionCanceledEventArgs(session_id=049aabfc8ced4f11b216973d7d54f844, result=SpeechRecognitionResult(result_id=476e5db191d54de4a8eeaba4bcb798a4, text="", reason=ResultReason.Canceled))
SESSION STOPPED SessionEventArgs(session_id=049aabfc8ced4f11b216973d7d54f844)
CLOSING on SessionEventArgs(session_id=049aabfc8ced4f11b216973d7d54f844)

To Reproduce Here is my code:

import azure.cognitiveservices.speech as speechsdk
import time
import os
os.environ["HTTP_PROXY"] = "192.168.0.93:7890"
os.environ["HTTPS_PROXY"] = "192.168.0.93:7890"
AZURE_API_KEY = "1c177bbcf05c4a39b8d7b5b8f93abccd"
AZURE_REGION = "eastus"

def from_file(file):
    speech_config = speechsdk.SpeechConfig(subscription=AZURE_API_KEY, region=AZURE_REGION)
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_EndSilenceTimeoutMs, "400000")
    speech_config.set_property(speechsdk.PropertyId.Speech_SegmentationSilenceTimeoutMs, "400000")
    speech_config.set_property(speechsdk.PropertyId.SpeechServiceConnection_InitialSilenceTimeoutMs, "400000")
    speech_config.set_property(speechsdk.PropertyId.Conversation_Initial_Silence_Timeout, "400000")
    audio_config = speechsdk.AudioConfig(filename=file)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    # The pronunciation assessment service has a longer default end silence timeout (5 seconds) than normal STT
    # as the pronunciation assessment is widely used in education scenario where kids have longer break in reading.
    # You can adjust the end silence timeout based on your real scenario.

    done = False
    text=""
    def session_recognized(evt):
        nonlocal text
        # with self.lock:
        text+=evt.result.text

    def stop_cb(evt):
        print('CLOSING on {}'.format(evt))
        speech_recognizer.stop_continuous_recognition()
        nonlocal done
        done = True

    speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    speech_recognizer.recognized.connect(session_recognized)
    speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

    speech_recognizer.session_stopped.connect(stop_cb)
    speech_recognizer.canceled.connect(stop_cb)

    speech_recognizer.start_continuous_recognition()
    while not done:
        time.sleep(.5)

    #result = speech_recognizer.recognize_once_async().get()
    print(text)
    return text

wav_path ="xxx.wav"
text = from_file(wav_path)

Expected behavior succesfully recognize audio and covert audio to text.

Additional context The wav file is attach here 001_Neutral_0.zip

kashifkhan commented 1 year ago

Thank you for the feedback @RobinWitch . We will investigate and get back to you asap.

kristapratico commented 1 year ago

@RobinWitch the speech SDK does not reside in this repo. Please move your issue to this repo: https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues and the speech maintainers will be able to help. Thanks!

Azure / azure-sdk-for-python

Some question about speech , return nothing if audio begin with a long silence #32088