Closed shrimad-mishra closed 11 months ago
Hi, this is Darren from the Speech SDK team. Please provide the following: 1) Your programing language and platforms/operating system. 2) Is the code you are writing running on the client (PC, mobile device) or a web service? 3) A better description of your scenario. Where does the audio come from? what format is the audio (sample rate, bit/sample)? Is the source a live stream, for which you need real-time speech-to-text? Is the input audio given in PCM (uncompressed) memory buffers? what is the duration of each audio buffer?
From the little information you provided (I may be wrong) it sounds like you need to use the SDK and push PCM audio buffers into it at real time, whenever one is available. See section titled Recognize speech from an in-memory stream". The alternative approach to recognizing from audio buffer is using the "pull" model instead of "push" (see section how to use the audio input stream)
Sure @dargilco
I am using Python on ubuntu 20.04 and I am writing code for web service. The audio is comming from in chunks and the wave form is Mono PCM with sample aret as 8000 and bits per spample is 128 and it is the live stream. The transcription should be in realtime and each chunk is of around 0.019 secs
Thank you @shrimad-mishra for the additional details.
The following examples may help you:
scenario: captioning
sample: speech_recognition_with_push_stream
sample: speech_recognition_with_pull_stream
Be sure to set the audio format to 8khz, since the default is 16khz (see audio_stream_format
variable in captioning scenario)
You will need to select between two ways of providing audio to the SDK, the "push" and "pull" models, as described in the links I shared.
It should be real time speech to text scenario, correct me if I am wrong
And can you provide more context because the mentioned code I have already tried but Azure returning blank text every time
Follow the instructions here: https://github.com/Azure-Samples/cognitive-services-speech-sdk/tree/master/samples/python/console
And run the two samples speech_recognition_with_push_stream and speech_recognition_with_pull_stream
Use the provided WAV files and make sure that works for you (you get recognition results). After that replace with your own WAV file.
Then try to incorporate similar code in your application with the live input audio stream. Make sure you are feeding the SDK correct audio buffers, and setting the input audio format correctly. Dump the input audio buffers to a file, use some audio editing tool to load the raw PCM audio and verify that you can play it and it sounds good. In your application, you would likely need to use PushAudioInputStream if you have a network source periodically giving you audio buffers.
If after all that it still does not work, please share your source code and SDK Log of a run that shows the issue.
speech_recognizer.recognized.connect(lambda evt: self.print_transcript(evt))
I am trying to attach my custom async callback function but it is not getting called
And one more thing how to check that user has stoped speaking, because recognised is called every time in continuous recognisition
@dargilco
@BrianMouncer as per our discussion this morning, please help @shrimad-mishra . Thanks!
This item has been open without activity for 19 days. Provide a comment on status and remove "update needed" label.
@shrimad-mishra In case your questions are still valid:
For custom async callback function e.g.
def recognized_cb(evt):
try:
result = evt.result
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print('RECOGNIZED: {}'.format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
print('NO MATCH: {}'.format(result.no_match_details.reason))
except Exception as e:
print(e)
speech_recognizer.recognized.connect(recognized_cb)
how to check that user has stoped speaking
Why do you need to do this? Continuous recognition runs until explicitly stopped and can recognize multiple phrases during that time. If you want the usage to be like
then instead of continuous recognition you could call speech_recognizer.recognize_once()
for step 1.
Alternatively, with continuous recognition you can check for a silence timeout after recognized speech. For example:
recognition_done = threading.Event()
def recognized_cb(evt):
try:
result = evt.result
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print('RECOGNIZED: {}'.format(result.text))
elif result.reason == speechsdk.ResultReason.NoMatch:
print('NOMATCH: {}'.format(result.no_match_details.reason))
if result.no_match_details.reason == speechsdk.NoMatchReason.InitialSilenceTimeout:
print('Closing because of silence timeout')
recognition_done.set()
except Exception as e:
print(e)
and start/stop recognition like
speech_recognizer.start_continuous_recognition_async().get()
recognition_done.wait()
speech_recognizer.stop_continuous_recognition_async().get()
Closed as solution examples were provided. Please open a new issue if more support is needed.
Hi, I am trying to build a voice bot using Azure speech-to-text, but I am stuck at a point where I am getting a PCM type of audio stream and want to get the transcription of that in real-time, but not able to find anything related to it. Please help ASAP.
This does not help here https://github.com/Azure-Samples/cognitive-services-speech-sdk/issues/2064
Attaching a file for your reference:- https://drive.google.com/file/d/1l11mOJR2RJX0UdoAiZe_ADvpzhHxx1-u/view?usp=sharing
Can you provide code and update the docs with the code itself.