dscripka / openWakeWord

An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Apache License 2.0
660 stars 62 forks source link

Compute graph architecture #113

Open twitchyliquid64 opened 7 months ago

twitchyliquid64 commented 7 months ago

Heya! Can you point me at where the compute graph is for openWakeWord? I would love to see if I can recreate it with Rust / nalgebra and do away with the python dependencies for inference entirely.

Love the project!

dscripka commented 7 months ago

Thanks, I'm glad you like the project!

Re-implementing everything in Rust sounds like it would be fun (and challenging)! To view the details and full graphs of the underlying models (e.g., the tflite or onnx versions), you should be able to use tools like Netron.

However, the models themselves don't have Python dependencies, it's just the audio state management and overall orchestration of openWakeWord that is written in Python. For example, here is a c++ port of openWakeWord that just needs the C++ onnxruntime.

twitchyliquid64 commented 7 months ago

Thanks! Having a quick scroll through the three models, it does appear that tract supports all those tensor ops! And they don't error when loading them!

I'm going to spend the afternoon seeing if I can get it working! :D

twitchyliquid64 commented 7 months ago

I'm making some good progress! But I'm having some trouble understanding the shape of the melspectogram and feature buffers you have internally, that seem to store some state between the model invocations.

I have a [1, 1280] array of samples coming into the melspectogram model just fine, which is outputting a tensor of shape [1, 1, 5, 32] which I am squeezing to [5, 32].

Where I'm getting confused is what happens next. What the actual input to the embedding model? Theres a loop accumulating a buffer, then stepping over it in increments of 8 for some reason, and I get lost.

dscripka commented 7 months ago

It is a little confusing, mostly because I try to make things efficient by computing the melspectrograms chunk by chunk. Some more information that might help:

1) The melspectrogram is calculated for each chunk, which is then added to previous calculations to build up the melspectrogram over time: https://github.com/dscripka/openWakeWord/blob/fe57debecc64d891084a8831f981ffb7110a3b58/openwakeword/utils.py#L387

2) The input to the embedding model needs a melspectrogram with 76 time slices. https://github.com/dscripka/openWakeWord/blob/fe57debecc64d891084a8831f981ffb7110a3b58/openwakeword/utils.py#L409 The increments of 8 are just so that the embedding model receives melspectrogram inputs a bit behind real-time to avoid the edge effects on the melspectrogram calculation.

twitchyliquid64 commented 7 months ago

Is what is called a 'chunk' a 80ms block of 1280 samples? And each of those 76 time slices is one [5, 32] output from a mellspectrogram run?

twitchyliquid64 commented 7 months ago

OMG I think I got it working! I see the number spurted in the terminal go to nearly 1 when I say 'hey rasspy'!! :D

Da code

```rust /// arecord -r 16000 -f S16_LE | cargo run use circular_buffer::CircularBuffer; use tract_onnx::prelude::*; fn samples_from_stdin( stdin: &mut std::io::StdinLock, ) -> tract_ndarray::ArrayBase, tract_ndarray::Dim<[usize; 2]>> { tract_ndarray::Array2::from_shape_fn((1, 1280), |(_, c)| { use std::io::Read; let mut buffer = [0u8; std::mem::size_of::()]; stdin.read_exact(&mut buffer).unwrap(); let sample = i16::from_le_bytes(buffer); sample as f32 }) } /// Number of spectograms we track, and the minimum input to the embedding model const NUM_SPECTOGRAMS: usize = 76; /// Number of embeddings we track, and the minimum input to the wakeword model const NUM_EMBEDDINGS: usize = 16; #[derive(Default, Clone, Debug)] struct Melspectogram([f32; 32]); impl Melspectogram { pub fn iter_mut(&mut self) -> core::slice::IterMut<'_, f32> { self.0.iter_mut() } pub fn iter(&self) -> core::slice::Iter<'_, f32> { self.0.iter() } } #[derive(Clone, Debug)] struct Embedding([f32; 96]); // derive(Default) doesnt work on arrays > 32, grrrr impl Default for Embedding { fn default() -> Self { Self([0f32; 96]) } } impl Embedding { pub fn iter(&self) -> core::slice::Iter<'_, f32> { self.0.iter() } } /// arecord -r 16000 -f S16_LE | cargo run fn main() -> TractResult<()> { let spec_model = tract_onnx::onnx() // load the model .model_for_path("melspectrogram.onnx")? .into_optimized()? .into_runnable()?; let embedding_model = tract_onnx::onnx() // load the model .model_for_path("embedding_model.onnx")? .with_input_fact(0, f32::fact([1, 76, 32, 1]).into()).unwrap() .into_optimized()? .into_runnable()?; let final_model = tract_onnx::onnx() // load the model .model_for_path("hey_rhasspy_v0.1.onnx")? .into_optimized()? .into_runnable()?; let mut spectograms = CircularBuffer::::new(); let mut embeddings = CircularBuffer::::new(); let mut stdin = std::io::stdin().lock(); for _ in 0..(4 * 16000 / 1280) { let samples: Tensor = samples_from_stdin(&mut stdin).into(); // run the spectogram on the input let out = spec_model.run(tvec!(samples.into()))?.remove(0); // so the spectogram output is [1, 1, 5, 32] but we only care about each 32-float sequence, // each of which represents a spectogram. Lets iterate in those chunks and add it to our buffer. for chunk in out.as_slice::().unwrap().chunks(32) { let mut out = Melspectogram::default(); chunk.into_iter().zip(out.iter_mut()).for_each(|(input, output)| { // Don't h8 this is what openWakeWords does! https://github.com/dscripka/openWakeWord/blob/main/openwakeword/utils.py#L180 // ¯\_(ツ)_/¯ ¯\_(ツ)_/¯ ¯\_(ツ)_/¯ ¯\_(ツ)_/¯ *output = *input / 10.0 + 2.0; }); spectograms.push_back(out); } // Don't compute the embeddings unless we have a full set of input (76 spectograms) // for the model if !spectograms.is_full() { continue; } // Build a tensor that will be the input to the embedding model, which is [?, 76, 32, 1]. // I presume that means [batch_size=1, num_melspectograms=76, num_spect_bins=32, ?]. let embedding_input: Tensor = tract_ndarray::Array::>::from_iter( spectograms.iter().map(|spect| spect.iter()).flatten().copied(), ).into_shape((1, 76, 32, 1))?.into(); // println!("model: {:?}", embedding_model.model()); // Compute the embedding for this chunk of spectograms. let out = embedding_model.run(tvec!(embedding_input.into()))?.remove(0); // so the embedding output is [1, 1, 1, 96], lets collect that into an Embedding struct // and push it into our embedding buffer. let mut embedding = Embedding::default(); embedding.0.clone_from_slice(out.as_slice::().unwrap()); embeddings.push_back(embedding); // Don't compute the features unless we have a full set of input (16 embeddings) if !embeddings.is_full() { continue; } // Build a tensor that will be the input to the feature model, which is [1, 16, 96]. let feature_input: Tensor = tract_ndarray::Array::>::from_iter( embeddings.iter().map(|spect| spect.iter()).flatten().copied(), ).into_shape((1, 16, 96))?.into(); let out = final_model.run(tvec!(feature_input.into()))?.remove(0); println!("{:?}", out); } Ok(()) } ```

I haven't done the increments of 8 thing, is that essential?

I only start inferring the model once the buffers are full, and I redo inference every time there's a new chunk of data. Even doing this rather than every 4 or 8 new embedding chunks, im not able to saturate a CPU core even with a debug binary.

Any thoughts / suggestions? or should I tidy this up and put it somewhere?

twitchyliquid64 commented 7 months ago

Gonna assume the 8-step isnt essential and give packaging this into a Rust binary with threading etc a go this afternoon!

twitchyliquid64 commented 7 months ago

https://github.com/twitchyliquid64/oww-rust-core :)

dscripka commented 7 months ago

That looks like a really great start, very nice!

Looking through the code, I think there are two improvements that should make the results identical to openWakeWord.

1) Implementing the streaming melspectrogram approach (as shown here) with an audio buffer. This will avoid some of the edge effects during the melspectrogram calculation.

2) Keep the step size of 8 on the spectrogram. I forgot some important details in my previous explanation; the step size of 8 is because the embedding model is intended to receive inputs every 80 ms, while the melspectrogram operates on a step size of 10 ms. So you need to step over the spectrogram in increments of 8 to match the expected 80 ms interval.

Even without these changes it's likely the model still works (as you are seeing), but I'm not sure if this will degrade performance compared to the openWakeWord implementation.

twitchyliquid64 commented 7 months ago

On point 1: I run buffers of 1280 samples through the spectogram at a time, just without the 480 sample delay. The edge effects you speak of sound like spectral leakage, so if thats the problem perhaps we can apply a windowing function to the raw samples instead? I'm happy to experiment.

On point 2: Makes sense, will do that next!

Ive had surprisingly good results despite this!

twitchyliquid64 commented 7 months ago

On point 2: I think I'm actually having better results using a step size of 4-6? I appreciate that this results in overlap with the input spectograms from one embedding to the next, but the activation seems to be both stronger, more sustained, and more distinctive.

dscripka commented 7 months ago

Point 1: The spectrogram calculation already includes a Hann window by default, but it's possible that other windows may work better. My original motivation for adding the buffers on both sides of the 1280 sample chunk was to ensure that the results of the streaming melspectrogram were as close as possible to the non-streaming results. But in practice, this may not be as important as I originally thought.

Point 2: Yes, reducing the step size for the embeddings lower than 8 could improve performance in some cases, as you are increasing the time resolution of the features. Note that this does slightly decrease efficiency, and also reduces the size of the input window in time, as each step is effectively shorter. Similar to point 1, this may have a neutral, positive, or negative impact on performance of a given model. Perhaps this could be a configurable parameter?

twitchyliquid64 commented 7 months ago

Interesting the specto model does a hann window, I would have thought that would be done a layer up so it could be overlapped. I experimented with doing a hamming window with 50% overlap but it didnt improve results (but didnt seem to reduce them either). I'll rip that out and move back to your approach.

Point 2: I'll make it configurable! I've come to enjoy a step size of 4.