Open douglasg14b opened 18 hours ago
Currently it cannot unfortunately however if you can find your own wakeword detector then you can use that instead of Picovoice. Underneath it all is Whisper for the actual speech to text. The Porcupine module is for detecting the word to begin listening and the Picovoice Cobra will classify the incoming audio as having either having spoken word or not to work out when to end recording audio.
If you do find an offline module for wake work detection add it to this thread as this repo was create a little while back.
What are you trying to create?
After a bit of Googling there are a few offline wake word detectors:
OpenWakeWord Howl MicroWakeWord
These are all offline and suited to ondevice implementations. Just need to replace the picovoice part of this with a different VAD and wakeword and you're all set.
Also found this https://github.com/secretsauceai/precise-wakeword-model-maker
Seems abandoned though,
Is Picovoice just used for training the wake word? They advertise their project as on device
which implies it should work locally without calling home?
I'm trying to break into this space by making a simple voice assistance with the goal being to make an assistant more catered to ADHD idiosyncrasies.
My skills lie in app/backend/enterprise development. Building complex systems to handle commands, integrations (official or reverse engineered), & translations is easy for me. However, getting a handle on the building blocks in the data-science & ML space to make a "natural language" interface is proving to be the difficult part.
This project for example, is deceptively simple once you open main.py
but there's 0 chance I would have been able to figure it out on my own in a reasonable amount of time.
Although Picovoice is ondevice it does a key activation check and I've found in an entirely offline situation if you don't first validate the key it'll hang as my colleague has found: https://github.com/Picovoice/porcupine/issues/579
This was a while ago though so it might have been changed.
Picovoice offers a few of parts in main.py
.
recorder.read()
reads in audio samples and then you call procupine.process(data)
to pass the samples into the porcupine Wake Word detector (these lines)The process is as follows:
recorder
object recoder.start() # start the mic capture
max_window_in_secs = 3 # how long the length of the audio should be sampled
window_size = sample_rate * max_window_in_secs # `sample rate` is the rate at which audio is sampled from the microphone per second * length of the window in seconds. This gives you the total number of buckets in your audio sample over n seconds.
samples = deque(maxlen=(window_size * 6)) # This is a queue of samples over 2 seconds (`window_size * 6`)
vad_samples = deque(maxlen=25) # This is a queue of Voice Activity Detection results
is_recording = False # starts as False when we first start the script
This is where the 16,000 and the frame size of 512 is coming from:
recoder = PvRecorder(device_index=-1, frame_length=512) #-1 is the default device, 512 samples in 1 second of audio captured at 16,000hz sample rate
# frame length = 512
# samples per frame = 16,000
# 1 sec = 16,000 / 512
while True:
data = recoder.read() # read the audio buffer from the mic
vad_prob = cobra.process(data) # ask cobra for the probability that there's spoken voice in each of the samples
vad_samples.append(vad_prob) # add the samples to an array so that we can later calculate the average number across that array and use a threshold to turn the mic off
if porcupine.process(data) >= 0: # if we have samples from the mic
print(f"Detected wakeword")
is_recording = True # start recording flag
samples.clear() # clear our old queue
if is_recording: #
if (
len(samples) < window_size # if the number of samples we have recorded than the number of samples we expect in the window (`sample rate * max_window_in_secs` so 16000 khz * 3 seconds = 48,000) then we keep adding to our samples array until our window is full
or np.mean(vad_samples) >= vad_mean_probability_sensitivity # or we look over the array of the voice activity detector results (ranging from 0.0 - no spoken voice to 1.0 definitely spoken voice) and we average those results for the 3 second sample then the script keeps listening if the model is confident that it's still hearing spoken word
):
samples.extend(data) # keep adding the audio data to our samples array
print(f"listening - samples: {len(samples)}")
else:
print("is_recording: False")
print(transcriber.transcribe(samples)) # transcriber is just a class that uses whisper on the data that we've taken from the microphone and prints the text (see below for more details)
is_recording = False # turn off the mic because there's either no spoken word or the voice activity detector is no longer detecting spoken voice
Transcriber
This class uses Whisper to take an array of audio samples and detect spoken words from the sample
class Transcriber:
def __init__(self, model) -> None:
self.model = whisper.load_model(model) # load the model
print("loading model finished")
self.prompts = os.environ.get("WHISPER_INITIAL_PROMPT", "")
print(f"Using prompts: {self.prompts}")
def transcribe(self, frames):
transcribe_start = time.time() # take the start time
"""
take the frames from the audio (often there's two channels for Left and Right, so flatten these, and convert the int16 values from the audio into float 32's that the Whisper model is expecting
then normalise the array which means take the int16 values (ranging from -32768 to 32767 and divide it by 32768 to instead get a range from -1 to 1 which Whisper is expecting
"""
samples = np.array(frames, np.int16).flatten().astype(np.float32) / 32768.0
result = self.model.transcribe(
audio=samples,
language="en",
fp16=False,
initial_prompt=self.prompts,
) # run the Whisper model to detect the text from the audio
transcribe_end = time.time() # calculate the end time
return result.get("text", "speech not detected") # return the result
Hopefully that makes things clearer for you, let me know if you need more explaination on parts. I should have left comments on the codebase!
Or without using their API or console?