How to print detailed output of azure speech SDK using python

vishnureddy45 commented 4 years ago

filename="sample.wav"   
speech_key, service_region = "a34634565t3", "eastus"
speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
audio_input = speechsdk.audio.AudioConfig(filename=filename)
print(audio_input)
speech_config.speech_recognition_language="en-US"  
speech_config.request_word_level_timestamps()
speech_config.enable_dictation()
speech_config.output_format = speechsdk.OutputFormat(1)

speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_input)

#result = speech_recognizer.recognize_once()
all_results = []

def handle_final_result(evt):
    all_results.append(evt.result.text) 

done = False

def stop_cb(evt):
    #print('CLOSING on {}'.format(evt))
    speech_recognizer.stop_continuous_recognition()
    nonlocal done
    done= True

#Appends the recognized text to the all_results variable. 
speech_recognizer.recognized.connect(handle_final_result) 

    #Connect callbacks to the events fired by the speech recognizer & displays the info/status
    #Ref:https://docs.microsoft.com/en-us/python/api/azure-cognitiveservices-speech/azure.cognitiveservices.speech.eventsignal?view=azure-python   
    #speech_recognizer.recognizing.connect(lambda evt: print('RECOGNIZING: {}'.format(evt)))
    #speech_recognizer.recognized.connect(lambda evt: print('RECOGNIZED: {}'.format(evt)))
    #speech_recognizer.session_started.connect(lambda evt: print('SESSION STARTED: {}'.format(evt)))
    #speech_recognizer.session_stopped.connect(lambda evt: print('SESSION STOPPED {}'.format(evt)))
    #speech_recognizer.canceled.connect(lambda evt: print('CANCELED {}'.format(evt)))

    # stop continuous recognition on either session stopped or canceled events
speech_recognizer.session_stopped.connect(stop_cb)
speech_recognizer.canceled.connect(stop_cb)

speech_recognizer.start_continuous_recognition()

while not done:
    time.sleep(.5)

print("Printing all results:")
print(all_results)

====================================================

Could see only 3 parameters from the result - result_id,text and reason.

Input: <azure.cognitiveservices.speech.audio.AudioConfig object at 0x000001B257153288> SESSION STARTED: SessionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=802d2bbbf1cc4445a5c1afda3702dcba, text="may", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=db0429c417184818bb20eaf7ca1280c1, text="vinay", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=f21afccf980b43e3897b96706f161249, text="Vinay", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=758451455d254e0db597491a02cfaf1d, text="animal", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=04b8282427684c85ab08dc03c966da68, text="animal", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=df78927b6708434e93cc4a2387cf74ee, text="huh", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=68ec8ebd72d742709a748ebae8d2b4df, text="", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=124670e2724e452faddab0004dc58d87, text="media", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=d86c5a2dfc664afabad63394693b07ba, text="huh", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=36d76b447bae462eb28711c728f8acb6, text="", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=7a712713fb804d408d8575c0cc765e86, text="up", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=b83a9baf196343fcbdbe091067e685d8, text="uh", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=1adbf320896640b2a0200c679ec68fbf, text="Uh Vinay.", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=15b47e4e3f2e40eaad2e3714a6023a20, text="i'm good", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=bc59de7f7b184982a5052e678485dda6, text="update last", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=a8be3ce31f074a4d85afc06c0b5b150a, text="i'm good", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=3d3e3a698ea64606b58dd7997e89018a, text="A good laugh.", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=bb027425cdfd4d3cb44ec566027bc978, text="1", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=4bab7fda1fba48e7894e99fc47d73604, text="148", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=e9971cec4f7e440096562455516e88e6, text="1487", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=6b436fa7aa614da4962b3d3bfb233fdd, text="1487", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=67a0c51d50774775b33739fd36213208, text="OK", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=19a38e99ee7641a495d258601ddf537c, text="OK OK", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=778a2db74deb407180d3633baff0d41b, text="OK", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=badcbc430a2a4227b2f90ce5a8225769, text="OK, OK, I think that's enough.", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=4b04fed09ce84ae99d6384d523ae89b8, text="huh", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=7001bed201474d3599c17065cc58be4f, text="", reason=ResultReason.RecognizedSpeech)) CANCELED SpeechRecognitionCanceledEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=ae217f42d03e4f8a853674a743cea3c6, text="", reason=ResultReason.Canceled)) CLOSING on SpeechRecognitionCanceledEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368, result=SpeechRecognitionResult(result_id=ae217f42d03e4f8a853674a743cea3c6, text="", reason=ResultReason.Canceled)) SESSION STOPPED SessionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368) CLOSING on SessionEventArgs(session_id=dc5ba40280a544ea82042c37c7b16368) Printing all results: ['Vinay', '', '', 'Uh Vinay.', 'A good laugh.', '1487', "OK, OK, I think that's enough.", '']

We are expecting more details of identifying speakers, Start time and end time from the wav file.Please suggest on this case.

glecaros commented 4 years ago

Hi @vishnureddy45,

You should be able to access durations and offsets directly from the result (see here).

As for speaker identification, you may want to look at Conversation Transcription or Speaker Recognition (although at this time there are not python bindings for the APIs that consume those services)

glecaros commented 4 years ago

Additionally you can use request_word_level_timestamps to get more detailed duration/offset results. You can access these results through properties (SpeechServiceResponse_JsonResult)

Azure-Samples / cognitive-services-speech-sdk

How to print detailed output of azure speech SDK using python #763