garbit / whisper-voice-assistant

Open AI Whisper automatic speech recognition powered voice assistant using Porcupine wake-word detection and Cobra voice acitivty detection
MIT License
16 stars 2 forks source link

Can this be used without using picovoice cloud services? #1

Open douglasg14b opened 18 hours ago

douglasg14b commented 18 hours ago

Or without using their API or console?

garbit commented 18 hours ago

Currently it cannot unfortunately however if you can find your own wakeword detector then you can use that instead of Picovoice. Underneath it all is Whisper for the actual speech to text. The Porcupine module is for detecting the word to begin listening and the Picovoice Cobra will classify the incoming audio as having either having spoken word or not to work out when to end recording audio.

If you do find an offline module for wake work detection add it to this thread as this repo was create a little while back.

What are you trying to create?

garbit commented 18 hours ago

After a bit of Googling there are a few offline wake word detectors:

OpenWakeWord Howl MicroWakeWord

These are all offline and suited to ondevice implementations. Just need to replace the picovoice part of this with a different VAD and wakeword and you're all set.

douglasg14b commented 16 hours ago

Also found this https://github.com/secretsauceai/precise-wakeword-model-maker

Seems abandoned though,

Is Picovoice just used for training the wake word? They advertise their project as on device which implies it should work locally without calling home?


I'm trying to break into this space by making a simple voice assistance with the goal being to make an assistant more catered to ADHD idiosyncrasies.

My skills lie in app/backend/enterprise development. Building complex systems to handle commands, integrations (official or reverse engineered), & translations is easy for me. However, getting a handle on the building blocks in the data-science & ML space to make a "natural language" interface is proving to be the difficult part.

This project for example, is deceptively simple once you open main.py but there's 0 chance I would have been able to figure it out on my own in a reasonable amount of time.

garbit commented 7 hours ago

Although Picovoice is ondevice it does a key activation check and I've found in an entirely offline situation if you don't first validate the key it'll hang as my colleague has found: https://github.com/Picovoice/porcupine/issues/579

This was a while ago though so it might have been changed.

Picovoice offers a few of parts in main.py.

  1. Audio input: The Picovoice have an audio processing library that allows you to easily access the mic without having to go super low level.
  2. Wake Work detection ("Hey Alexa / Hey Google / Hey Bixby" etc) - The recorder.read() reads in audio samples and then you call procupine.process(data) to pass the samples into the porcupine Wake Word detector (these lines)
  3. Voice Activity Detector - classifies the audio data as having a voice actually speaking in the audio sample from 0.0 - 1.0 confidence. this bit

The process is as follows:

  1. Initialise recorder object
recoder.start() # start the mic capture
max_window_in_secs = 3 # how long the length of the audio should be sampled
window_size = sample_rate * max_window_in_secs # `sample rate` is the rate at which audio is sampled from the microphone per second * length of the window in seconds. This gives you the total number of buckets in your audio sample over n seconds.
samples = deque(maxlen=(window_size * 6)) # This is a queue of samples over 2 seconds (`window_size * 6`)
vad_samples = deque(maxlen=25) # This is a queue of Voice Activity Detection results
is_recording = False # starts as False when we first start the script

This is where the 16,000 and the frame size of 512 is coming from:

recoder = PvRecorder(device_index=-1, frame_length=512) #-1 is the default device, 512 samples in 1 second of audio captured at 16,000hz sample rate
# frame length = 512
# samples per frame = 16,000
# 1 sec = 16,000 / 512
  1. Read the audio in, get the probability that there's voice in our sample window, and see if our Wake Word detector is confident that it's heard the wake word
while True:
        data = recoder.read() # read the audio buffer from the mic
        vad_prob = cobra.process(data) # ask cobra for the probability that there's spoken voice in each of the samples
        vad_samples.append(vad_prob) # add the samples to an array so that we can later calculate the average number across that array and use a threshold to turn the mic off

        if porcupine.process(data) >= 0: # if we have samples from the mic
            print(f"Detected wakeword")
            is_recording = True # start recording flag
            samples.clear() # clear our old queue
  1. When we're recording, work out if we need to continue listening by taking the mean average of the voice activity detector (literally take the scores from the model 0 - 1 and then sum them and divide by the quantity we have in the array) if this is greater than a threshold that's set above, then continue listening on the mic because we think there's still spoken word happening in the audio. Otherwise, use the Whisper model to turn the speech into text.
        if is_recording: # 
            if (
                len(samples) < window_size # if the number of samples we have recorded than the number of samples we expect in the window (`sample rate * max_window_in_secs` so 16000 khz * 3 seconds = 48,000) then we keep adding to our samples array until our window is full
                or np.mean(vad_samples) >= vad_mean_probability_sensitivity # or we look over the array of the voice activity detector results (ranging from 0.0 - no spoken voice to 1.0 definitely spoken voice) and we average those results for the 3 second sample then the script keeps listening if the model is confident that it's still hearing spoken word
            ):
                samples.extend(data) # keep adding the audio data to our samples array
                print(f"listening - samples: {len(samples)}")
            else:
                print("is_recording: False") 
                print(transcriber.transcribe(samples)) # transcriber is just a class that uses whisper on the data that we've taken from the microphone and prints the text (see below for more details)
                is_recording = False # turn off the mic because there's either no spoken word or the voice activity detector is no longer detecting spoken voice

Transcriber This class uses Whisper to take an array of audio samples and detect spoken words from the sample

class Transcriber:
    def __init__(self, model) -> None:
        self.model = whisper.load_model(model) # load the model
        print("loading model finished")
        self.prompts = os.environ.get("WHISPER_INITIAL_PROMPT", "") 
        print(f"Using prompts: {self.prompts}")

    def transcribe(self, frames):
        transcribe_start = time.time() # take the start time

"""
take the frames from the audio (often there's two channels for Left and Right, so flatten these, and convert the int16 values from the audio into float 32's that the Whisper model is expecting
then normalise the array which means take the int16 values (ranging from -32768 to 32767 and divide it by 32768 to instead get a range from -1 to 1 which Whisper is expecting
"""
        samples = np.array(frames, np.int16).flatten().astype(np.float32) / 32768.0

        result = self.model.transcribe(
            audio=samples,
            language="en",
            fp16=False,
            initial_prompt=self.prompts,
        ) # run the Whisper model to detect the text from the audio

        transcribe_end = time.time() # calculate the end time
        return result.get("text", "speech not detected") # return the result

Hopefully that makes things clearer for you, let me know if you need more explaination on parts. I should have left comments on the codebase!