Picovoice / porcupine

On-device wake word detection powered by deep learning
https://picovoice.ai/
Apache License 2.0
3.71k stars 495 forks source link

Using Porcupine to trigger Google Speech API #88

Closed osmadja closed 6 years ago

osmadja commented 6 years ago

I'm trying to use Porcupine wake word detection as a trigger for Google Speech API.

This is not exactly an issue. I am looking for orientation about the best way to use Porcupine to trigger Google Speech API using Python.

The problem seems to be in the fact that both Porcupine and Google Speech Calls use PyAudio.

I used the python Porcupine demo code as a basis and then call the google speech api.

It is not clear if I need to create a new PyAudio instance to call Google Speech API or If I could "share" the same PyAudio instance.

When terminating the Porcupine PyAudio instance (pa.terminate()) and creating a new Pyaudio instance for Google Speech Api, it seems to work. I then need to recreate an instance of PyAudio for Porcupine. It works a few times and then starts to stop in the porcupine demo 'while True' loop).

I also tried sharing the porcupine pyaudio instance( after changing the pyAudio parameters to the ones recommended by Google).

It works once : i can call the google speech API and it return correctly but then raise the following error when the pyAudio instance is reused by Porcupine :

Traceback (most recent call last):
  File "porcupine_human.py", line 264, in <module>
    input_device_index=args.input_audio_device_index).run()
  File "porcupine_human.py", line 135, in run
    pcm = audio_stream.read(porcupine.frame_length)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyaudio.py", line 608, in read
    return pa.read_stream(self._stream, num_frames, exception_on_overflow)
OSError: [Errno -9981] Input overflowed

Expected behavior

After detecting a wake work, it should call the google speech api.

Actual behavior

It stops after a few Api calls

Steps to reproduce the behavior

Configure google credential :

export GOOGLE_APPLICATION_CREDENTIALS="/Users/xxx/xxx/google-api/xxxx-xxx.json"

replace the amigo_mac.ppn with your own wake word file

Execute :

$ python3.6 porcupine_demo_gvoice.py --keyword_file_paths ../../amigo_mac.ppn 

Say the word , wait to see "* detected keyword" on the console and then say any other text (that should be recognized by google api).

Repeat this 5 times to see the problem happening.

porcupine_demo_gvoice.py

#
# Copyright 2018 Picovoice Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import argparse
import os
import platform
import struct
import sys
from datetime import datetime
from threading import Thread

import numpy as np
import pyaudio
import soundfile

sys.path.append(os.path.join(os.path.dirname(__file__), '../../binding/python'))

from porcupine import Porcupine

from vr import use_once

class PorcupineDemo(Thread):
    """
    Demo class for wake word detection (aka Porcupine) library. It creates an input audio stream from a microphone,
    monitors it, and upon detecting the specified wake word(s) prints the detection time and index of wake word on
    console. It optionally saves the recorded audio into a file for further review.
    """

    def __init__(
            self,
            library_path,
            model_file_path,
            keyword_file_paths,
            sensitivities,
            input_device_index=None,
            output_path=None):

        """
        Constructor.

        :param library_path: Absolute path to Porcupine's dynamic library.
        :param model_file_path: Absolute path to the model parameter file.
        :param keyword_file_paths: List of absolute paths to keyword files.
        :param sensitivities: Sensitivity parameter for each wake word. For more information refer to
        'include/pv_porcupine.h'. It uses the
        same sensitivity value for all keywords.
        :param input_device_index: Optional argument. If provided, audio is recorded from this input device. Otherwise,
        the default audio input device is used.
        :param output_path: If provided recorded audio will be stored in this location at the end of the run.
        """

        super(PorcupineDemo, self).__init__()

        self._library_path = library_path
        self._model_file_path = model_file_path
        self._keyword_file_paths = keyword_file_paths
        self._sensitivities = sensitivities
        self._input_device_index = input_device_index

        self._output_path = output_path
        if self._output_path is not None:
            self._recorded_frames = []

    def run(self):
        """
         Creates an input audio stream, initializes wake word detection (Porcupine) object, and monitors the audio
         stream for occurrences of the wake word(s). It prints the time of detection for each occurrence and index of
         wake word.
         """

        num_keywords = len(self._keyword_file_paths)

        keyword_names =\
            [os.path.basename(x).replace('.ppn', '').replace('_tiny', '').split('_')[0] for x in self._keyword_file_paths]

        print('listening for:')
        for keyword_name, sensitivity in zip(keyword_names, sensitivities):
            print('- %s (sensitivity: %f)' % (keyword_name, sensitivity))

        porcupine = None
        pa = None
        audio_stream = None
        try:
            porcupine = Porcupine(
                library_path=self._library_path,
                model_file_path=self._model_file_path,
                keyword_file_paths=self._keyword_file_paths,
                sensitivities=self._sensitivities)

            print('Frames per buffer=%d' % porcupine.frame_length)
            print('Rate=%d' % porcupine.sample_rate)

            pa = pyaudio.PyAudio()
            audio_stream = pa.open(
                rate=porcupine.sample_rate,
                channels=1,
                format=pyaudio.paInt16,
                input=True,
                # frames_per_buffer=porcupine.frame_length,
                frames_per_buffer= int(porcupine.sample_rate / 10),
                input_device_index=self._input_device_index)
            py_audio_has_been_reset = False

            i = 0

            while True:
                i = i + 1

                if py_audio_has_been_reset:
                    pa = pyaudio.PyAudio()
                    audio_stream = pa.open(
                    rate=porcupine.sample_rate,
                    channels=1,
                    format=pyaudio.paInt16,
                    input=True,
                    frames_per_buffer=porcupine.frame_length,
                    input_device_index=self._input_device_index)

                pcm = audio_stream.read(porcupine.frame_length)
                pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)

                if self._output_path is not None:
                    self._recorded_frames.append(pcm)

                result = porcupine.process(pcm)
                if num_keywords == 1 and result:
                    print('[%s] * detected keyword' % str(datetime.now()))
                    self.trigger_voice_recognition(pa)
                    py_audio_has_been_reset = False
                elif num_keywords > 1 and result >= 0:
                    print('[%s] * detected %s' % (str(datetime.now()), keyword_names[result]))
                    self.trigger_voice_recognition(pa)
                    py_audio_has_been_reset = False

        except KeyboardInterrupt:
            print('stopping ...')
        finally:
            if porcupine is not None:
                porcupine.delete()

            if audio_stream is not None:
                audio_stream.close()

            if pa is not None:
                pa.terminate()

            if self._output_path is not None and len(self._recorded_frames) > 0:
                recorded_audio = np.concatenate(self._recorded_frames, axis=0).astype(np.int16)
                soundfile.write(self._output_path, recorded_audio, samplerate=porcupine.sample_rate, subtype='PCM_16')

    _AUDIO_DEVICE_INFO_KEYS = ['index', 'name', 'defaultSampleRate', 'maxInputChannels']

    @classmethod
    def show_audio_devices_info(cls):
        """ Provides information regarding different audio devices available. """

        pa = pyaudio.PyAudio()

        for i in range(pa.get_device_count()):
            info = pa.get_device_info_by_index(i)
            print(', '.join("'%s': '%s'" % (k, str(info[k])) for k in cls._AUDIO_DEVICE_INFO_KEYS))

        pa.terminate()

    def trigger_voice_recognition(self, my_py_audio):
        print("Triggering VoiceRecognition")
        language_code = 'pt-BR'  # a BCP-47 language tag

        # closing pyaudio of porcupine
        # my_py_audio.terminate()

        # create a pyaudio instance for google api
        # pa2 = pyaudio.PyAudio()
        use_once(language_code,my_py_audio)
        print("End VoiceRecognition")
        # kills pyaudio of google
        # pa2.terminate()

def _default_library_path():
    system = platform.system()
    machine = platform.machine()

    if system == 'Darwin':
        return os.path.join(os.path.dirname(__file__), '../../lib/mac/%s/libpv_porcupine.dylib' % machine)
    elif system == 'Linux':
        if machine == 'x86_64' or machine == 'i386':
            return os.path.join(os.path.dirname(__file__), '../../lib/linux/%s/libpv_porcupine.so' % machine)
        else:
            raise Exception('cannot autodetect the binary type. Please enter the path to the shared object using --library_path command line argument.')
    elif system == 'Windows':
        if platform.architecture()[0] == '32bit':
            return os.path.join(os.path.dirname(__file__), '..\\..\\lib\\windows\\i686\\libpv_porcupine.dll')
        else:
            return os.path.join(os.path.dirname(__file__), '..\\..\\lib\\windows\\amd64\\libpv_porcupine.dll')
    raise NotImplementedError('Porcupine is not supported on %s/%s yet!' % (system, machine))

if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    parser.add_argument('--keyword_file_paths', help='comma-separated absolute paths to keyword files', type=str)

    parser.add_argument(
        '--library_path',
        help="absolute path to Porcupine's dynamic library",
        type=str)

    parser.add_argument(
        '--model_file_path',
        help='absolute path to model parameter file',
        type=str,
        default=os.path.join(os.path.dirname(__file__), '../../lib/common/porcupine_params.pv'))

    parser.add_argument('--sensitivities', help='detection sensitivity [0, 1]', default=0.5)
    parser.add_argument('--input_audio_device_index', help='index of input audio device', type=int, default=None)

    parser.add_argument(
        '--output_path',
        help='absolute path to where recorded audio will be stored. If not set, it will be bypassed.',
        type=str,
        default=None)

    parser.add_argument('--show_audio_devices_info', action='store_true')

    args = parser.parse_args()

    if args.show_audio_devices_info:
        PorcupineDemo.show_audio_devices_info()
    else:
        if not args.keyword_file_paths:
            raise ValueError('keyword file paths are missing')

        keyword_file_paths = [x.strip() for x in args.keyword_file_paths.split(',')]

        if isinstance(args.sensitivities, float):
            sensitivities = [args.sensitivities] * len(keyword_file_paths)
        else:
            sensitivities = [float(x) for x in args.sensitivities.split(',')]

        PorcupineDemo(
            library_path=args.library_path if args.library_path is not None else _default_library_path(),
            model_file_path=args.model_file_path,
            keyword_file_paths=keyword_file_paths,
            sensitivities=sensitivities,
            output_path=args.output_path,
            input_device_index=args.input_audio_device_index).run()

vr.py : Google Speech API calls

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Google Cloud Speech API sample application using the streaming API.

use google api to process voice recognition and send it to HumanRobotics core system

asr.ini file is used to configure the module
"""

# [START import_libraries]
from __future__ import division

import re
import sys
import traceback

from google.cloud import speech
from google.cloud.speech import enums
from google.cloud.speech import types
import pyaudio
from six.moves import queue

import configparser
import time
import os
import logging
from threading import Thread

from tools import MyLogger
from tools import MySocket
from tools import Status
from tools import Tools

# [END import_libraries]

# Audio recording parameters
RATE = 16000
CHUNK = int(RATE / 10)  # 100ms

class MicrophoneStream(object):

    """Opens a recording stream as a generator yielding the audio chunks."""
    def __init__(self, rate, chunk):
        self._rate = rate
        self._chunk = chunk

        self.re_init()

        self.close_py_audio_streams = False

        # config
        self.config_mic_index = 0
        self.config_mic_name = "?"
        self.config_time_to_wait_to_consider_temp_result_in_ms = 1500
        self.config_wake_word_1 = "amigo"
        self.config_wake_word_2 = "amiga"
        self.config_robot_ip = "localhost"
        self.config_wake_word_period_validity_in_s = 60
        self.init_config()

        # audio stream (created by pyaudio)
        self._audio_interface = None
        self._audio_stream = None
        self._buff = queue.Queue()
        self.closed = False

        # to avoid sending twice the same status
        self.last_sent_status = -1

        # to keep track of when the last wake word has bee received
        self.last_wake_word_received_timestamp = 0

        self.last_manage_response_loop_call = -1
        # to keep track of when the object started
        self.created = time.time()

    def re_init(self):
        # Create a thread-safe buffer of audio data
        self._buff = queue.Queue()
        self.closed = True

    def get_microphone_index(self):
        p = pyaudio.PyAudio()
        info = p.get_host_api_info_by_index(0)
        numdevices = info.get('deviceCount')

        index = -1

        print("Found {} devices".format(numdevices))

        for i in range(0, numdevices):
            if (p.get_device_info_by_host_api_device_index(0, i).get('maxInputChannels')) > 0:
                name = p.get_device_info_by_host_api_device_index(0, i).get('name')

                if self.config_mic_name in name:
                    print("> Input Device id {} - {} **".format(i, name))
                    index = i
                else:
                    print("> Input Device id {} - {}".format(i, name))
        return index

    def init_config(self):
        config = configparser.ConfigParser()
        config.read("asr.ini")
        self.config_mic_name = config['ASR']['mic_name']
        self.config_mic_index = self.get_microphone_index()

        print("Config : Microphone array {} : index {}".format(self.config_mic_name, self.config_mic_index))

    def __enter__(self):
        return self.init_audio_stream()

    def __exit__(self, type, value, traceback):
        self.close_audio_stream()

    def init_audio_stream(self, my_py_audio):

        self._audio_interface = my_py_audio
        self._audio_stream = self._audio_interface.open(
            format=pyaudio.paInt16,
            # The API currently only supports 1-channel (mono) audio
            # https://goo.gl/z757pE
            channels=1, rate=self._rate,
            input=True, frames_per_buffer=self._chunk,
            # Run the audio stream asynchronously to fill the buffer object.
            # This is necessary so that the input device's buffer doesn't
            # overflow while the calling thread makes network requests, etc.
            stream_callback=self._fill_buffer,
            input_device_index=self.config_mic_index
        )

        self.closed = False

        return self

    def close_audio_stream(self):

        if self.close_py_audio_streams :
            self._audio_stream.stop_stream()
            self._audio_stream.close()
            self._audio_interface.terminate()
        self.closed = True
        # Signal the generator to terminate so that the client's
        # streaming_recognize method will not block the process termination.
        self._buff.put(None)

    def _fill_buffer(self, in_data, frame_count, time_info, status_flags):
        """Continuously collect data from the audio stream, into the buffer."""
        self._buff.put(in_data)
        return None, pyaudio.paContinue

    def generator(self):
        while not self.closed:
            # Use a blocking get() to ensure there's at least one chunk of
            # data, and stop iteration if the chunk is None, indicating the
            # end of the audio stream.
            chunk = self._buff.get()
            if chunk is None:
                return
            data = [chunk]

            # Now consume whatever other data's still buffered.
            while True:
                try:
                    chunk = self._buff.get(block=False)
                    if chunk is None:
                        return
                    data.append(chunk)
                except queue.Empty:
                    break

            yield b''.join(data)

    def manage_responses(self,responses):

        already_sent_processing = False

        self.last_manage_response_loop_call = -1

        """Iterates through server responses and prints them.

        The responses passed is a generator that will block until a response
        is provided by the server.

        Each response may contain multiple results, and each result may contain
        multiple alternatives; for details, see https://goo.gl/tjCPAU.  Here we
        print only the transcription for the top alternative of the top result.

        In this case, responses are provided for interim results as well. If the
        response is an interim one, print a line feed at the end of it, to allow
        the next result to overwrite it, until the response is a final one. For the
        final one, print a newline to preserve the finalized transcription.
        """
        num_chars_printed = 0
        last_response = time.time()
        for response in responses:

            self.last_manage_response_loop_call = time.time()

            if not already_sent_processing:
                already_sent_processing = True

            # myLog.info("in for responses ")
            if not response.results:

                # if no char has been printed or last response if big =>
                # this means there has not been a temporary result before => it is staled
                if num_chars_printed < 3:
                    break
                else:
                    continue
            else:
                last_response = time.time() - last_response

            # myLog.info("responses results {} ".format(len(response.results)))
            # myLog.info(len(response.results))

            # The `results` list is consecutive. For streaming, we only care about
            # the first result being considered, since once it's `is_final`, it
            # moves on to considering the next utterance.
            result = response.results[0]
            if not result.alternatives:
                continue

            # Display the transcription of the top alternative.
            transcript = result.alternatives[0].transcript

            # Display interim results, but with a carriage return at the end of the
            # line, so subsequent lines will overwrite them.
            #
            # If the previous result was longer than this one, we need to print
            # some extra spaces to overwrite the previous result
            overwrite_chars = ' ' * (num_chars_printed - len(transcript))

            if not result.is_final:

                #print(transcript + overwrite_chars + '\r')
                #sys.stdout.write(transcript + overwrite_chars + '\r')
                #sys.stdout.flush()

                num_chars_printed = len(transcript)

                # myLog.info("temp result:{} - when :{} - stability:{}".format(transcript, 0, result.stability))

            else:
                print("### {}".format(transcript))

                break
                # print("Final : "+ transcript + overwrite_chars )

                # Exit recognition if any of the transcribed phrases could be
                # one of our keywords.
                if re.search(r'\b(exit|quit)\b', transcript, re.I):
                    print('Exiting..')
                    break

                num_chars_printed = 0

    # [END audio_stream]

def use_once(language_code, my_py_audio):
    client = speech.SpeechClient()
    config = types.RecognitionConfig(
        encoding=enums.RecognitionConfig.AudioEncoding.LINEAR16,
        sample_rate_hertz=RATE,
        language_code=language_code)
    streaming_config = types.StreamingRecognitionConfig(
        config=config,
        interim_results=True,
        single_utterance=True)

    stream = MicrophoneStream(RATE, CHUNK)

    stream.init_audio_stream(my_py_audio)
    audio_generator = stream.generator()
    requests = (types.StreamingRecognizeRequest(audio_content=content)
                for content in audio_generator)

    responses = client.streaming_recognize(streaming_config, requests)

    # Now, put the transcription responses to use.
    stream.manage_responses(responses)
    stream.re_init()
    stream.close_audio_stream()

def main():
    # See http://g.co/cloud/speech/docs/languages
    # for a list of supported languages.
    # language_code = 'en-US'  # a BCP-47 language tag
    # language_code = 'pt-BR'  # a BCP-47 language tag
    language_code = 'en-US'  # a BCP-47 language tag

    my_py_audio = pyaudio.PyAudio()
    use_once(language_code,my_py_audio)

if __name__ == '__main__':
    main()

File asr.ini : Configuration file to define wich microphone to be used : out the usb microphone name. Put this file in the same directory as the py files.

[ASR]
mic_name=ReSpeaker
kenarsa commented 6 years ago

Please don't copy-paste large amount of code to the issue and I certainly won't be able to make time to read through all this.

You probably want to have one instance and feed the audio yourself to porcupine and then google api. that should work.

bbrendon commented 3 years ago

@osmadja did you ever complete your script? I'm trying to do the same thing.

alsacian commented 3 years ago

This dirty code combines Porcupine and Google Cloud Speech, made for improvement.

porcupine_google.docx

bbrendon commented 3 years ago

This dirty code combines Porcupine and Google Cloud Speech, made for improvement.

Thanks. I'll probably try it at some point.