How to monkey-patch the Recognizer.listen method?

jhoelzl commented 8 years ago

Hello,

for my project, i want to make some custom modification in the Recognizer.listen method.

According to this tutorial i implemented following code:

import speech_recognition as sr

class customRecognizer(sr.Recognizer):
    def listen(self, source, timeout = None):
    ....
    # My modified listen method
    ....

sr.Recognizer = customRecognizer

However, i have problems with the Audiosource instance:

assert isinstance(source, AudioSource), "Source must be an audio source"

and i always get this error:

 'Unexpected error:', <type 'exceptions.NameError'>

Any suggestions? Thanks!

Uberi commented 8 years ago

Hi @jhoelzl,

I'm not sure what your full code looks like (if you paste it here, I could take a look), but you should ensure that the source parameter is always an AudioSource.

You can even make a custom AudioSource class by using something like the following:

class CustomSource(sr.AudioSource):
    def __init__(self):
        print("abcd")

Passing that into the listen function as the source argument will then work correctly.

jhoelzl commented 8 years ago

Hello @Uberi, thanks for explanation, i tried this:

import speech_recognition as sr

class CustomSource(sr.AudioSource):
    def __init__(self):
        print("abcd")

class customRecognizer(sr.Recognizer):
    def listen(self, source, timeout = None):
    ....
    # for testing exactly the same code as written in original listen() function
    ....

sr.Recognizer = customRecognizer

# obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Say something!")
audio = r.listen(source)

How exactly should i pass my customSource to the listen() function?

Uberi commented 8 years ago

Ah, I see, in that case you'll want to subclass sr.Microphone rather than sr.AudioSource:

import speech_recognition as sr

class CustomSource(sr.Microphone):
    def __init__(self):
        print("about to initialize microphone")
        super(self).__init__() # this will call the sr.Microphone initializer
        print("done initializing microphone")

class CustomRecognizer(sr.Recognizer):
    def listen(self, source, timeout = None):
        print("starting listening")
        super(self).listen(source, timeout) # call `sr.Recogniser.listen` (this line isn't necessary, I just put it here to demonstrate)
        print("done listening")

sr.Recognizer = CustomRecognizer
sr.Microphone = CustomSource

# obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone() as source:
    print("Say something!")
audio = r.listen(source)

There is probably a simpler way to do what you're looking for though - what kind of listening functionality is needed here that isn't already present?

jhoelzl commented 8 years ago

Thanks, i just want to modify some code in the listen() method of the library to improve the decision of the Voice Activity Detector.

Hmm, i get this error, when i copy your code into a script called "test_custom.py" and run:

File "test_custom.py", line 20, in with sr.Microphone() as source: File "test_custom.py", line 6, in init super(self).init() # this will call the sr.Microphone initializer TypeError: must be type, not CustomSource

Uberi commented 8 years ago

Seems like you're using Python 2 - you'll need some modifications to make super work.

jhoelzl commented 8 years ago

Thanks, yes, i was using Python 2.7.6.

jhoelzl commented 8 years ago

Maybe you can help me with one more issue:

This is my code:

import speech_recognition as sr

class CustomSource(sr.Microphone):
    def __init__(self, device_index = None, sample_rate = 16000, chunk_size = 1024):
        print("about to initialize microphone")
        super(self.__class__, self).__init__()
        print("done initializing microphone")

class CustomRecognizer(sr.Recognizer):
    def listen(self, source, timeout = None):
        print("starting listening")
        super(self.__class__, self).listen(source, timeout)
        print("done listening")

sr.Recognizer = CustomRecognizer
sr.Microphone = CustomSource

# obtain audio from the microphone
r = sr.Recognizer()
with sr.Microphone(chunk_size = 512) as source:
    print("Say something!")
    audio = r.listen(source)

# Google ASR
try:
    print("Google Speech Recognition thinks you said " + r.recognize_google(audio))
except sr.UnknownValueError:
    print("Google Speech Recognition could not understand audio")
except sr.RequestError as e:
    print("Could not request results from Google Speech Recognition service; {0}".format(e))

I get this error in recognize_google():

assert isinstance(audio_data, AudioData), "audio_data must be audio data" AssertionError: audio_data must be audio data

Uberi commented 8 years ago

Hi @jhoelzl,

Your listen method needs to actually return the listened audio. You probably want to store the result of the super() method call, then return that at the end.

Aakashdeveloper commented 7 years ago

does this solved your error because i am having the same problem

jhoelzl commented 7 years ago

No, unfortunately not. I am still use the original library code and adjust the code directly, but without CustomRecognizer or CustomSource.

Uberi commented 7 years ago

Hi @jhoelzl, @Aakashdeveloper

The following code works fine for me in Python 2, after implementing the changes I mentioned above:

import speech_recognition as sr

class CustomSource(sr.Microphone):
    def __init__(self, device_index = None, sample_rate = 16000, chunk_size = 1024):
        print("about to initialize microphone")
        result = super(self.__class__, self).__init__()
        print("done initializing microphone")
        return result

class CustomRecognizer(sr.Recognizer):
    def listen(self, source, timeout = None):
        print("starting listening")
        result = super(self.__class__, self).listen(source, timeout)
        print("done listening")
        return result

sr.Recognizer = CustomRecognizer
sr.Microphone = CustomSource

r = sr.Recognizer()
with sr.Microphone(chunk_size = 512) as source:
    audio = r.listen(source)
print("Google Speech Recognition thinks you said " + r.recognize_google(audio))

jhoelzl commented 7 years ago

Thanks, i will try again when i have time.

jhoelzl commented 7 years ago

Hi @Uberi ,

i need to write all the code from listen() into my custom listen() function since i modify a lot in this method. So i need something like this:

import speech_recognition as sr

import math
import audioop
import collections

class CustomAudioData(sr.AudioData):
    def __init__(self, frame_data, sample_rate, sample_width):
        print("about to initialize audiodata")
        result = super(self.__class__, self).__init__()
        print("done initializing audiodata")
        return result

class CustomSource(sr.Microphone):
    def __init__(self, device_index = None, sample_rate = 16000, chunk_size = 1024):
        print("about to initialize microphone")
        result = super(self.__class__, self).__init__()
        print("done initializing microphone")
        return result

class CustomRecognizer(sr.Recognizer):
    # Special custom changes in this method
    def listen(self, source, timeout=None, phrase_time_limit=None):
        """
        Records a single phrase from ``source`` (an ``AudioSource`` instance) into an ``AudioData`` instance, which it returns.
        This is done by waiting until the audio has an energy above ``recognizer_instance.energy_threshold`` (the user has started speaking), and then recording until it encounters ``recognizer_instance.pause_threshold`` seconds of non-speaking or there is no more audio input. The ending silence is not included.
        The ``timeout`` parameter is the maximum number of seconds that this will wait for a phrase to start before giving up and throwing an ``speech_recognition.WaitTimeoutError`` exception. If ``timeout`` is ``None``, there will be no wait timeout.
        The ``phrase_time_limit`` parameter is the maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If ``phrase_timeout`` is ``None``, there will be no phrase time limit.
        This operation will always complete within ``timeout + phrase_timeout`` seconds if both are numbers, either by returning the audio data, or by raising an exception.
        """
        assert isinstance(source, CustomSource), "Source must be an audio source"
        assert source.stream is not None, "Audio source must be entered before listening, see documentation for ``AudioSource``; are you using ``source`` outside of a ``with`` statement?"
        assert self.pause_threshold >= self.non_speaking_duration >= 0

        seconds_per_buffer = (source.CHUNK + 0.0) / source.SAMPLE_RATE
        pause_buffer_count = int(math.ceil(
            self.pause_threshold / seconds_per_buffer))  # number of buffers of non-speaking audio during a phrase, before the phrase should be considered complete
        phrase_buffer_count = int(math.ceil(
            self.phrase_threshold / seconds_per_buffer))  # minimum number of buffers of speaking audio before we consider the speaking audio a phrase
        non_speaking_buffer_count = int(math.ceil(
            self.non_speaking_duration / seconds_per_buffer))  # maximum number of buffers of non-speaking audio to retain before and after a phrase

        # read audio input for phrases until there is a phrase that is long enough
        elapsed_time = 0  # number of seconds of audio read
        buffer = b""  # an empty buffer means that the stream has ended and there is no data left to read
        while True:
            frames = collections.deque()

            # store audio input until the phrase starts
            while True:
                # handle waiting too long for phrase by raising an exception
                elapsed_time += seconds_per_buffer
                if timeout and elapsed_time > timeout:
                    raise WaitTimeoutError("listening timed out while waiting for phrase to start")

                buffer = source.stream.read(source.CHUNK)
                if len(buffer) == 0: break  # reached end of the stream
                frames.append(buffer)
                if len(
                        frames) > non_speaking_buffer_count:  # ensure we only keep the needed amount of non-speaking buffers
                    frames.popleft()

                # detect whether speaking has started on audio input
                energy = audioop.rms(buffer, source.SAMPLE_WIDTH)  # energy of the audio signal
                if energy > self.energy_threshold: break

                # dynamically adjust the energy threshold using asymmetric weighted average
                if self.dynamic_energy_threshold:
                    damping = self.dynamic_energy_adjustment_damping ** seconds_per_buffer  # account for different chunk sizes and rates
                    target_energy = energy * self.dynamic_energy_ratio
                    self.energy_threshold = self.energy_threshold * damping + target_energy * (1 - damping)

            # read audio input until the phrase ends
            pause_count, phrase_count = 0, 0
            phrase_start_time = elapsed_time
            while True:
                # handle phrase being too long by cutting off the audio
                elapsed_time += seconds_per_buffer
                if phrase_time_limit and elapsed_time - phrase_start_time > phrase_time_limit:
                    break

                buffer = source.stream.read(source.CHUNK)
                if len(buffer) == 0: break  # reached end of the stream
                frames.append(buffer)
                phrase_count += 1

                # check if speaking has stopped for longer than the pause threshold on the audio input
                energy = audioop.rms(buffer, source.SAMPLE_WIDTH)  # unit energy of the audio signal within the buffer
                if energy > self.energy_threshold:
                    pause_count = 0
                else:
                    pause_count += 1
                if pause_count > pause_buffer_count:  # end of the phrase
                    break

            # check how long the detected phrase is, and retry listening if the phrase is too short
            phrase_count -= pause_count  # exclude the buffers for the pause before the phrase
            if phrase_count >= phrase_buffer_count or len(
                buffer) == 0: break  # phrase is long enough or we've reached the end of the stream, so stop listening

        # obtain frame data
        for i in range(
                    pause_count - non_speaking_buffer_count): frames.pop()  # remove extra non-speaking frames at the end
        frame_data = b"".join(list(frames))

        return CustomAudioData(frame_data, source.SAMPLE_RATE, source.SAMPLE_WIDTH)

sr.AudioData = CustomAudioData
sr.Recognizer = CustomRecognizer
sr.Microphone = CustomSource

r = sr.Recognizer()
with sr.Microphone(chunk_size=512) as source:
    audio = r.listen(source)
print("Google Speech Recognition thinks you said " + r.recognize_google(audio))

However, i have some troubles with the AudioData object. I tried to make a custom one (CustomAudioData) but i get an error when initializing the class since i do not know how to get the required arguments: 'frame_data', 'sample_rate', and 'sample_width'.

jhoelzl commented 7 years ago

Now it seems to work - it is not necessary to define a custom AudioData object, i just use sr.AudioData:

import speech_recognition as sr

import math
import audioop
import collections

class CustomSource(sr.Microphone):
    def __init__(self, device_index = None, sample_rate = 16000, chunk_size = 1024):
        print("about to initialize microphone")
        result = super(self.__class__, self).__init__()
        print("done initializing microphone")
        return result

class CustomRecognizer(sr.Recognizer):
    # Special custom changes in this method
    def listen(self, source, timeout=None, phrase_time_limit=None):
        """
        Records a single phrase from ``source`` (an ``AudioSource`` instance) into an ``AudioData`` instance, which it returns.
        This is done by waiting until the audio has an energy above ``recognizer_instance.energy_threshold`` (the user has started speaking), and then recording until it encounters ``recognizer_instance.pause_threshold`` seconds of non-speaking or there is no more audio input. The ending silence is not included.
        The ``timeout`` parameter is the maximum number of seconds that this will wait for a phrase to start before giving up and throwing an ``speech_recognition.WaitTimeoutError`` exception. If ``timeout`` is ``None``, there will be no wait timeout.
        The ``phrase_time_limit`` parameter is the maximum number of seconds that this will allow a phrase to continue before stopping and returning the part of the phrase processed before the time limit was reached. The resulting audio will be the phrase cut off at the time limit. If ``phrase_timeout`` is ``None``, there will be no phrase time limit.
        This operation will always complete within ``timeout + phrase_timeout`` seconds if both are numbers, either by returning the audio data, or by raising an exception.
        """
        assert isinstance(source, CustomSource), "Source must be an audio source"
        assert source.stream is not None, "Audio source must be entered before listening, see documentation for ``AudioSource``; are you using ``source`` outside of a ``with`` statement?"
        assert self.pause_threshold >= self.non_speaking_duration >= 0

        seconds_per_buffer = (source.CHUNK + 0.0) / source.SAMPLE_RATE
        pause_buffer_count = int(math.ceil(
            self.pause_threshold / seconds_per_buffer))  # number of buffers of non-speaking audio during a phrase, before the phrase should be considered complete
        phrase_buffer_count = int(math.ceil(
            self.phrase_threshold / seconds_per_buffer))  # minimum number of buffers of speaking audio before we consider the speaking audio a phrase
        non_speaking_buffer_count = int(math.ceil(
            self.non_speaking_duration / seconds_per_buffer))  # maximum number of buffers of non-speaking audio to retain before and after a phrase

        # read audio input for phrases until there is a phrase that is long enough
        elapsed_time = 0  # number of seconds of audio read
        buffer = b""  # an empty buffer means that the stream has ended and there is no data left to read
        while True:
            frames = collections.deque()

            # store audio input until the phrase starts
            while True:
                # handle waiting too long for phrase by raising an exception
                elapsed_time += seconds_per_buffer
                if timeout and elapsed_time > timeout:
                    raise WaitTimeoutError("listening timed out while waiting for phrase to start")

                buffer = source.stream.read(source.CHUNK)
                if len(buffer) == 0: break  # reached end of the stream
                frames.append(buffer)
                if len(
                        frames) > non_speaking_buffer_count:  # ensure we only keep the needed amount of non-speaking buffers
                    frames.popleft()

                # detect whether speaking has started on audio input
                energy = audioop.rms(buffer, source.SAMPLE_WIDTH)  # energy of the audio signal
                if energy > self.energy_threshold: break

                # dynamically adjust the energy threshold using asymmetric weighted average
                if self.dynamic_energy_threshold:
                    damping = self.dynamic_energy_adjustment_damping ** seconds_per_buffer  # account for different chunk sizes and rates
                    target_energy = energy * self.dynamic_energy_ratio
                    self.energy_threshold = self.energy_threshold * damping + target_energy * (1 - damping)

            # read audio input until the phrase ends
            pause_count, phrase_count = 0, 0
            phrase_start_time = elapsed_time
            while True:
                # handle phrase being too long by cutting off the audio
                elapsed_time += seconds_per_buffer
                if phrase_time_limit and elapsed_time - phrase_start_time > phrase_time_limit:
                    break

                buffer = source.stream.read(source.CHUNK)
                if len(buffer) == 0: break  # reached end of the stream
                frames.append(buffer)
                phrase_count += 1

                # check if speaking has stopped for longer than the pause threshold on the audio input
                energy = audioop.rms(buffer, source.SAMPLE_WIDTH)  # unit energy of the audio signal within the buffer
                if energy > self.energy_threshold:
                    pause_count = 0
                else:
                    pause_count += 1
                if pause_count > pause_buffer_count:  # end of the phrase
                    break

            # check how long the detected phrase is, and retry listening if the phrase is too short
            phrase_count -= pause_count  # exclude the buffers for the pause before the phrase
            if phrase_count >= phrase_buffer_count or len(
                buffer) == 0: break  # phrase is long enough or we've reached the end of the stream, so stop listening

        # obtain frame data
        for i in range(
                    pause_count - non_speaking_buffer_count): frames.pop()  # remove extra non-speaking frames at the end
        frame_data = b"".join(list(frames))

        return sr.AudioData(frame_data, source.SAMPLE_RATE, source.SAMPLE_WIDTH)

sr.Recognizer = CustomRecognizer
sr.Microphone = CustomSource

r = sr.Recognizer()
with sr.Microphone(chunk_size=512) as source:
    audio = r.listen(source)
print("Google Speech Recognition thinks you said " + r.recognize_google(audio))

Aakashdeveloper commented 7 years ago

Adding this piece of code r = sr.Recognizer() device_index = 1 with sr.Microphone(device_index,16000,2048) as source: r.adjust_for_ambient_noise(source) logging.info("checked minimum energy threshold to {}".format(r.energy_threshold)) time.sleep(1.0)

I am able to do STT

Uberi / speech_recognition

How to monkey-patch the Recognizer.listen method? #162