Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.95k stars 1.86k forks source link

A continuous PCM type audio streaming to azure speech to text #2069

Closed shrimad-mishra closed 11 months ago

shrimad-mishra commented 1 year ago

Hi, I am trying to build a voice bot using Azure speech-to-text, but I am stuck at a point where I am getting a PCM type of audio stream and want to get the transcription of that in real-time, but not able to find anything related to it. Please help ASAP.

This does not help here https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2064

Attaching a file for your reference:- https://drive.google.com/file/d/1l11mOJR2RJX0UdoAiZe_ADvpzhHxx1-u/view?usp=sharing

Can you provide code and update the docs with the code itself.

dargilco commented 1 year ago

Hi, this is Darren from the Speech SDK team. Please provide the following: 1) Your programing language and platforms/operating system. 2) Is the code you are writing running on the client (PC, mobile device) or a web service? 3) A better description of your scenario. Where does the audio come from? what format is the audio (sample rate, bit/sample)? Is the source a live stream, for which you need real-time speech-to-text? Is the input audio given in PCM (uncompressed) memory buffers? what is the duration of each audio buffer?

From the little information you provided (I may be wrong) it sounds like you need to use the SDK and push PCM audio buffers into it at real time, whenever one is available. See section titled Recognize speech from an in-memory stream". The alternative approach to recognizing from audio buffer is using the "pull" model instead of "push" (see section how to use the audio input stream)

shrimad-mishra commented 1 year ago

Sure @dargilco

I am using Python on ubuntu 20.04 and I am writing code for web service. The audio is comming from in chunks and the wave form is Mono PCM with sample aret as 8000 and bits per spample is 128 and it is the live stream. The transcription should be in realtime and each chunk is of around 0.019 secs

dargilco commented 1 year ago

Thank you @shrimad-mishra for the additional details.

The following examples may help you:

scenario: captioning

sample: speech_recognition_with_push_stream

sample: speech_recognition_with_pull_stream

Be sure to set the audio format to 8khz, since the default is 16khz (see audio_stream_format variable in captioning scenario)

You will need to select between two ways of providing audio to the SDK, the "push" and "pull" models, as described in the links I shared.

shrimad-mishra commented 1 year ago

It should be real time speech to text scenario, correct me if I am wrong

And can you provide more context because the mentioned code I have already tried but Azure returning blank text every time

dargilco commented 1 year ago

Follow the instructions here: https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/python/console

And run the two samples speech_recognition_with_push_stream and speech_recognition_with_pull_stream

Use the provided WAV files and make sure that works for you (you get recognition results). After that replace with your own WAV file.

Then try to incorporate similar code in your application with the live input audio stream. Make sure you are feeding the SDK correct audio buffers, and setting the input audio format correctly. Dump the input audio buffers to a file, use some audio editing tool to load the raw PCM audio and verify that you can play it and it sounds good. In your application, you would likely need to use PushAudioInputStream if you have a network source periodically giving you audio buffers.

If after all that it still does not work, please share your source code and SDK Log of a run that shows the issue.

shrimad-mishra commented 1 year ago

speech_recognizer.recognized.connect(lambda evt: self.print_transcript(evt))

I am trying to attach my custom async callback function but it is not getting called

And one more thing how to check that user has stoped speaking, because recognised is called every time in continuous recognisition

@dargilco

dargilco commented 1 year ago

@BrianMouncer as per our discussion this morning, please help @shrimad-mishra . Thanks!

github-actions[bot] commented 1 year ago

This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.

pankopon commented 11 months ago

@shrimad-mishra In case your questions are still valid:

For custom async callback function e.g.

    def recognized_cb(evt):
        try:
            result = evt.result
            if result.reason == speechsdk.ResultReason.RecognizedSpeech:
                print('RECOGNIZED: {}'.format(result.text))
            elif result.reason == speechsdk.ResultReason.NoMatch:
                print('NO MATCH: {}'.format(result.no_match_details.reason))
        except Exception as e:
            print(e)

    speech_recognizer.recognized.connect(recognized_cb)

how to check that user has stoped speaking

Why do you need to do this? Continuous recognition runs until explicitly stopped and can recognize multiple phrases during that time. If you want the usage to be like

  1. Prompt the user for input
  2. Run some action depending on the recognized speech
  3. Go to 1

then instead of continuous recognition you could call speech_recognizer.recognize_once() for step 1. Alternatively, with continuous recognition you can check for a silence timeout after recognized speech. For example:

    recognition_done = threading.Event()

    def recognized_cb(evt):
        try:
            result = evt.result
            if result.reason == speechsdk.ResultReason.RecognizedSpeech:
                print('RECOGNIZED: {}'.format(result.text))
            elif result.reason == speechsdk.ResultReason.NoMatch:
                print('NOMATCH: {}'.format(result.no_match_details.reason))
                if result.no_match_details.reason == speechsdk.NoMatchReason.InitialSilenceTimeout:
                    print('Closing because of silence timeout')
                    recognition_done.set()
        except Exception as e:
            print(e)

and start/stop recognition like

    speech_recognizer.start_continuous_recognition_async().get()
    recognition_done.wait()
    speech_recognizer.stop_continuous_recognition_async().get()
pankopon commented 11 months ago

Closed as solution examples were provided. Please open a new issue if more support is needed.