Closed HKAlwala closed 12 months ago
@HKAlwala Thanks for the report, could you please summarize the main problem? Please provide input and expected output and the actual output you see.
Input: ? Expected output: ? Actual output: ?
Hi,
Input: This process is for transcribing the audio file to text in realtime. As speaker speaks, feed is written to wav file. Python programs reads the wav file in the form of frames and sends to API immediately with almost no delay.
Expected output: Sent frames should be transcribed and return transcribed text immediately as words as they speak. That means response should be returned in words or phrases as per the frames sent. As a feature I am looking to show the text as the use speaks in the form of words or sentences continuously.
Actual output: Transcribed text is returned with considerable gap as a big paragraph. I believe this is because I use this API ( transcriber.start_transcribing_async() ).
My understanding is using this API ( transcriber.continous_recognition_ascyn() ) should get the expected behaviour. But this is failing with "NotImplementedError" error in runtime.
@HKAlwala Thanks for a summary! The problem is probably related to segmentation. There will be an improvement released in the near future in service side for segmentation, however I can see your code already touches on Speech_SegmentationSilenceTimeoutMs property, please double check that your code is correct there, because that can impact to outcome. How to adjust the segmentation silence timeout, see the following document: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-csharp#change-how-silence-is-handled
Also for your other question, these APIs are correct ones to use for continuous speech transcription. transcriber.start_transcribing_async() transcriber.stop_transcribing_async()
Closing the issue as answered and that improvement for current segmentation is worked on in service team, ETA beginning of year 2024.
Hi
I have written python code to convert speech to text using Azure Cognitive services. (1.33.0). My requirement is to get the speaker diarization and word level timestamps. I was looking to show the transcribed text as received on screen to get the realtime transcription feeling. I am sending frames (PushStream) from the audio file (wav file) to get the response text in realtime. The problem here is I get a sentences as response instead of each word or phrase. Users have to wait for text shown on the screen until some time.
Here is my code: `def conversation_transcriber_recognition_canceled_cb(evt: speechsdk.SessionEventArgs): print('Canceled event' + str(evt.result)) logger.info('Canceled event' + str(evt.result))
def conversation_transcriber_session_stopped_cb(evt: speechsdk.SessionEventArgs): print('SessionStopped event') logger.info('SessionStopped event')
def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs): print('TRANSCRIBED:') logger.info("Fetching the 'TRANSCRIBED content'...") try: paraDict = dict() results = json.loads(evt.result.json) displayText = results['DisplayText'] print("displayText-->" + displayText) speakerName = results['SpeakerId'] paraDict['SpeakerName'] = speakerName paraDict['Text'] = displayText fileFormat = datetime.now().strftime("%Y-%m-%d-%H-%M-%S") logger.info(fileFormat + " - ") logger.info(paraDict) processWords(paraDict=paraDict)
write results JSON to a file for later processing.
def conversation_transcriber_session_started_cb(evt: speechsdk.SessionEventArgs): print('SessionStarted event') logger.info('SessionStarted event')
def push_stream_writer(stream):
The number of bytes to push per buffer
def conversation_transcription(): """transcribes a conversation"""
Creates speech configuration with subscription information
From the code you can see that I have used class "speechsdk.transcription.ConversationTranscriber" and method used is "start_transcribing_async()". I felt this could be the problem, why I get in response as transcribed text of large sentences. I later changed the method to "continous_recognition_ascyn()". But I get 'NotImplementedError' in runtime. Is this not implemented for this class?
For SpeechRecognizer class, this works and I get response of transcribed text immediately in the form of words, where I feel it as realtime text transcription.
Does using "continous_recognition_ascyn()" this method will solve the problem, assuming you fix this?