HKAlwala commented 12 months ago

Hi

I have written python code to convert speech to text using Azure Cognitive services. (1.33.0). My requirement is to get the speaker diarization and word level timestamps. I was looking to show the transcribed text as received on screen to get the realtime transcription feeling. I am sending frames (PushStream) from the audio file (wav file) to get the response text in realtime. The problem here is I get a sentences as response instead of each word or phrase. Users have to wait for text shown on the screen until some time.

Here is my code: `def conversation_transcriber_recognition_canceled_cb(evt: speechsdk.SessionEventArgs): print('Canceled event' + str(evt.result)) logger.info('Canceled event' + str(evt.result))

def conversation_transcriber_session_stopped_cb(evt: speechsdk.SessionEventArgs): print('SessionStopped event') logger.info('SessionStopped event')

def conversation_transcriber_transcribed_cb(evt: speechsdk.SpeechRecognitionEventArgs): print('TRANSCRIBED:') logger.info("Fetching the 'TRANSCRIBED content'...") try: paraDict = dict() results = json.loads(evt.result.json) displayText = results['DisplayText'] print("displayText-->" + displayText) speakerName = results['SpeakerId'] paraDict['SpeakerName'] = speakerName paraDict['Text'] = displayText fileFormat = datetime.now().strftime("%Y-%m-%d-%H-%M-%S") logger.info(fileFormat + " - ") logger.info(paraDict) processWords(paraDict=paraDict)

write results JSON to a file for later processing.

    resultsJSONFileObj.write(json.dumps(results) + ",\n")
    resultsJSONFileObj.flush()
except Exception:
    print(traceback.format_exc())
    logging.error(traceback.format_exc())

def conversation_transcriber_session_started_cb(evt: speechsdk.SessionEventArgs): print('SessionStarted event') logger.info('SessionStarted event')

def push_stream_writer(stream):

The number of bytes to push per buffer

n_bytes = 3200 * 4
# start pushing data until all data has been read from the file
try:
    wav_fh = wave.open(audio_file)
    while True:
        frames = wav_fh.readframes(n_bytes // 2)
        # print('read {} bytes'.format(len(frames)))
        if (len(frames) == 0):
            logger.info('waiting for the frames.... length = 0')
            time.sleep(2)
            continue
        if not frames:
            break
        stream.write(frames)
        time.sleep(.1)
finally:
    wav_fh.close()
    stream.close()  # must be done to signal the end of stream

def conversation_transcription(): """transcribes a conversation"""

Creates speech configuration with subscription information

speech_config = speechsdk.SpeechConfig(
    subscription=subscription_key, region=region)
speech_config.enable_dictation()
speech_config.output_format = speechsdk.OutputFormat(1)
speech_config.request_word_level_timestamps()

channels = 1
bits_per_sample = 16
samples_per_second = 16000

# Create audio configuration using the push stream
wave_format = speechsdk.audio.AudioStreamFormat(
    samples_per_second, bits_per_sample, channels)
stream = speechsdk.audio.PushAudioInputStream(stream_format=wave_format)
audio_config = speechsdk.audio.AudioConfig(stream=stream)

# Set conversation ending detection timeout (4 hours in seconds)
conversation_ending_detection_timeout = 4 * 60 * 60
# speech_config.set_service_property("conversationEndSilenceTimeoutMs", str(
# conversation_ending_detection_timeout * 1000), speechsdk.ServicePropertyChannel.UriQueryParameter)
# OR
# Set conversation ending detection timeout (4 hours in seconds)
speech_config.set_service_property(str(speechsdk.PropertyId.Speech_SegmentationSilenceTimeoutMs), str(
    conversation_ending_detection_timeout * 1000), speechsdk.ServicePropertyChannel.UriQueryParameter)
valueHere = speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs 
logger.info("speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs-->" + str(valueHere))
speech_config.set_service_property(str(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFirstByteLatencyMs),str(500), speechsdk.ServicePropertyChannel.UriQueryParameter)
speech_config.set_service_property(str(speechsdk.PropertyId.SpeechServiceResponse_SynthesisFinishLatencyMs),str(500),  speechsdk.ServicePropertyChannel.UriQueryParameter)
transcriber = speechsdk.transcription.ConversationTranscriber(
    speech_config, audio_config)

# start push stream writer thread
push_stream_writer_thread = threading.Thread(
    target=push_stream_writer, args=[stream])

push_stream_writer_thread.start()

time.sleep(.1)

done = False

def stop_cb(evt: speechsdk.SessionEventArgs):
    """callback that signals to stop continuous transcription upon receiving an event `evt`"""
    print('CLOSING {}'.format(evt))
    nonlocal done
    done = True

# Subscribe to the events fired by the conversation transcriber

# transcriber.transcribing.connect(conversation_transcriber_transcribing_cb)
transcriber.transcribed.connect(conversation_transcriber_transcribed_cb)
transcriber.session_started.connect(
    conversation_transcriber_session_started_cb)
transcriber.session_stopped.connect(
    conversation_transcriber_session_stopped_cb)
transcriber.canceled.connect(
    conversation_transcriber_recognition_canceled_cb)
# stop continuous transcription on either session stopped or canceled events
transcriber.session_stopped.connect(stop_cb)
transcriber.canceled.connect(stop_cb)

transcriber.start_transcribing_async()

# Waits for completion.
while not done:
    time.sleep(.1)
transcriber.stop_transcribing_async()`

From the code you can see that I have used class "speechsdk.transcription.ConversationTranscriber" and method used is "start_transcribing_async()". I felt this could be the problem, why I get in response as transcribed text of large sentences. I later changed the method to "continous_recognition_ascyn()". But I get 'NotImplementedError' in runtime. Is this not implemented for this class?

For SpeechRecognizer class, this works and I get response of transcribed text immediately in the form of words, where I feel it as realtime text transcription.

Does using "continous_recognition_ascyn()" this method will solve the problem, assuming you fix this?

jhakulin commented 12 months ago

@HKAlwala Thanks for the report, could you please summarize the main problem? Please provide input and expected output and the actual output you see.

Input: ? Expected output: ? Actual output: ?

HKAlwala commented 12 months ago

Hi,

Input: This process is for transcribing the audio file to text in realtime. As speaker speaks, feed is written to wav file. Python programs reads the wav file in the form of frames and sends to API immediately with almost no delay.

Expected output: Sent frames should be transcribed and return transcribed text immediately as words as they speak. That means response should be returned in words or phrases as per the frames sent. As a feature I am looking to show the text as the use speaks in the form of words or sentences continuously.

Actual output: Transcribed text is returned with considerable gap as a big paragraph. I believe this is because I use this API ( transcriber.start_transcribing_async() ).

My understanding is using this API ( transcriber.continous_recognition_ascyn() ) should get the expected behaviour. But this is failing with "NotImplementedError" error in runtime.

jhakulin commented 12 months ago

@HKAlwala Thanks for a summary! The problem is probably related to segmentation. There will be an improvement released in the near future in service side for segmentation, however I can see your code already touches on Speech_SegmentationSilenceTimeoutMs property, please double check that your code is correct there, because that can impact to outcome. How to adjust the segmentation silence timeout, see the following document: https://learn.microsoft.com/en-us/azure/ai-services/speech-service/how-to-recognize-speech?pivots=programming-language-csharp#change-how-silence-is-handled

Also for your other question, these APIs are correct ones to use for continuous speech transcription. transcriber.start_transcribing_async() transcriber.stop_transcribing_async()

jhakulin commented 12 months ago

Closing the issue as answered and that improvement for current segmentation is worked on in service team, ETA beginning of year 2024.

Azure-Samples / cognitive-services-speech-sdk

Reduce latency while transcribing the speech to text - Cognitive Services Speech to text #2165

write results JSON to a file for later processing.

The number of bytes to push per buffer

Creates speech configuration with subscription information