dscripka / openWakeWord

An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Apache License 2.0
777 stars 80 forks source link

"Computer" as wake word #71

Closed kthhrv closed 10 months ago

kthhrv commented 1 year ago

Star Trek, enough said :-)

StuartIanNaylor commented 1 year ago

Computer (kuhm·pyoo·tuh) is a good KW as its got at least 3 sylables similar to Alexa. They a good strong phones aswell, only downside is that nowadays computer is a common word and likely you will get that 'She who shall not be named' syndrome Alexa users get. Likely if using a common word the 2nd stage should be a great way to stop that happening. https://github.com/dscripka/openWakeWord#user-specific-models For many households

at the cost of making the openWakeWord system less likely to respond to new voices

Is not cost at all, in fact more secure and voices can be added at any time.

feuler commented 1 year ago

I've tried twice to train a model for target_word "Computer" with the "automatic_model_training_simple.ipynb" (https://colab.research.google.com/drive/1q1oe2zOyZp7UsB3jJiQ1IFn8z5YfjwEb?usp=sharing)

But the resulted models don't recognize when i say "computer" when doing tests with "detect_from_microphone.py". First try was training with default settings. Second try with "number_of_examples: 40000 ; number_of_training_steps: 25000

Also... when i try to load the onnx file with Javascript onnx.min.js i get the error: Error: input tensor[0] check failed: expected shape '[1,16,96]' but got [3,1,40,61]

StuartIanNaylor commented 1 year ago

Did you try uping the no of examples and training steps as think the defaults might be a bit optimistic.

number_of_examples controls how many examples of your wakeword are generated. The default (1,000) usually produces a good model, but between 30,000 and 50,000 is often the best. ` Which is a considerable difference that maybe also should be a similar ratio on steps to get 'best'?

feuler commented 1 year ago

@StuartIanNaylor Yes. As mentioned in my last post, the second time i've increased the numbers to: "number_of_examples: 40000 ; number_of_training_steps: 25000" Which didn't improve wake word detection.

StuartIanNaylor commented 1 year ago

Yeah it makes 2 of us was just wondering as with 'Computer' I was the same whilst the premade models seem to work great. I hacked out the jupiter notebook and ran local, not colab so was just wondering if there was any difference.

feuler commented 1 year ago

Found what i needed... https://github.com/fwartner/home-assistant-wakewords-collection/tree/main/computer

Used training parameters:

number_of_examples = 25000 number_of_training_steps = 500000 false_activation_penalty = 5000 target_words = ["khom-pioodr", "khom-pioodr!", "khom-pioodr?", "khom-pioota", "khom-pioota!", "khom-pyoota?", "hey computer", "hey computer?", "hey computer!"]

I converted computer_v2.tflite from the repo to onnx. It then worked with openWakeWord and has near perfect detection.

StuartIanNaylor commented 1 year ago

It seems like they are add real data to the training to get it accurate. Homeassitant have given it such salesspeak that what they show doesn't seem to have much reality and not sure why as users will experience it. I like what dscripta has done its great but wow.

1437 files added to positive_train:

411*3 files (computer) from https://github.com/Picovoice/wake-word-benchmark

58*3 files (computer/en) from https://github.com/MycroftAI/Precise-Community-Data

10*3 files (heycomputer/en) from https://github.com/MycroftAI/Precise-Community-Data

I am pretty sure Mlcommons has a load of 'computer' single work samples https://mlcommons.org/en/multilingual-spoken-words/ also they have a load of hey if you ever just want to concatenate the two. That is actually a bonus as upstream ASR often is idle and likely you could do on-device training of collated data and ship OTA to devices if satelites. Its a shame they didn't do this for dscripta as users could be opting in and collating a goldmine of data for him.

dscripka commented 1 year ago

I'm glad you found a "computer" model from the Home Assistant community that seems to be working!

Adding real clips of the wakeword absolutely does increase performance, so if such data is available for a given wakeword it should be included in the training clips as a best practice. But the synthetic data only training process should still produce a reasonable baseline model.

I'm not sure why the Colab Notebook (automatic_model_training_simple.ipynb) couldn't produce one. I haven't had issues with this in the past, but I will attempt to reproduce the failure and determine if there is something wrong when using this particular word.

StuartIanNaylor commented 1 year ago

Have you any plans to start capturing KW from the rolling window of the KWS as I have done this before and with a decent margin it is pretty easy to do. Automate and allow users to train again with there own capture data?

dscripka commented 1 year ago

I'm glad you found a "computer" model from the Home Assistant community that seems to be working!

Adding real clips of the wakeword absolutely does increase performance, so if such data is available for a given wakeword it should be included in the training clips as a best practice. But the synthetic data only training process should still produce a reasonable baseline model.

I'm not sure why the Colab Notebook (automatic_model_training_simple.ipynb) couldn't produce one. I haven't had issues with this in the past, but I will attempt to reproduce the failure and determine if there is something wrong when using this particular word.

After doing some testing I think the issue is related to how the Piper TTS model is pronouncing certain single words, similar to a another open issue. When I was testing in the notebook I noticed that the hard "k" was not consistently being produced by the TTS model, which means that the trained model might not respond well to the correct pronunciation. There is an open issue in the Piper repo to address this.

In the meantime, a work-around is to try other phonetic spellings of the target word (e.g., khom-puter as was done in the linked model above).

dscripka commented 1 year ago

Have you any plans to start capturing KW from the rolling window of the KWS as I have done this before and with a decent margin it is pretty easy to do. Automate and allow users to train again with there own capture data?

I have done some initial experimentation with this type of data capture and online-learning when using the custom verifier models, yes. The balance appears to be difficult to optimize, as too many captures produces a dataset that is too noisy, while too few labels does not increase overall model performance substantially. I hope to return to this topic soon and continue to make progress, as it's likely one of the most promising ways to increase practical deployment performance.

StuartIanNaylor commented 1 year ago

Also I got the wrong end of the stick about training language models as presume a Piper model is relatively painfree. More details will be available from our paper: James Lin, Kevin Kilgour, Dominik Roblek, Matthew Sharifi "Training Keyword Spotters with Limited and Synthesized Speech Data" _The embedding model converts a stream of audio into a stream of 96-dimensional feature vectors, one every 80 ms. To ensure that the resulting embedding is useful for arbitrary sets of The embedding model converts a stream of audio into a stream of 96-dimensional feature vectors, one every 80 ms. To ensure that the resulting embedding is useful for arbitrary sets of keywords, we took 5000 keywords (most of them are actually 2-3 words long) and split them into random groups of 40 keywords. The resulting 125 groups of keywords are used to train 125 keyword spotting models with shared weights for the embedding model part (see Figure 1). We used roughly 200 million 2-second audio clips from YouTube for training, of which 100 million contained the target keywords and the other 100 million were used as non-target examples. This embedding model is available on Tensorflow Hub for public use. The embedding model is trained using TensorFlow [1] on 20 GPUs for 2 days and is available on TensorFlow Hub (https: //tfhub.dev/google/speech_embedding/1) for reuse., we took 5000 keywords (most of them are actually 2-3 words long) and split them into random groups of 40 keywords. The resulting 125 groups of keywords are used to train 125 keyword spotting models with shared weights for the embedding model part (see Figure 1). We used roughly 200 million 2-second audio clips from YouTube for training, of which 100 million contained the target keywords and the other 100 million were used as non-target examples. This embedding model is available on Tensorflow Hub for public use. The embedding model is trained using TensorFlow [1] on 20 GPUs for 2 days and is available on TensorFlow Hub (https: //tfhub.dev/google/speechembedding/1) for reuse.

I am presuming because of the nature of the datasets we have and because they specifically picked KWs to include and exclude the embedding model is multilingual but heavily biased to US English. Each language has a sonority hireachy where language families likely could be treated as one where the current embedding model could be split into language families as this allows the collation of smaller datasets of those specific languages but also inclusion of some languages that may only have sparse datasets. Or you do language based embedding models which likely will be hard due to sparse datasets of some languages. The way the embedding model was trained as well as language it is also biased to certain KW due to the way the data was picked. Likely the dataset chosen has holes in in it for certain language/accent/word combinations, not sure how much more accuracy will be gained but likely would, also like they did targetting KW and keeping to those KW could also improve accuracy.

joshuaboniface commented 12 months ago

@feuler

I converted computer_v2.tflite from the repo to onnx. It then worked with openWakeWord and has near perfect detection.

How did you do this? tflite2onnx seems to just throw an AssertionError when trying to convert these, and they don't seem to work at all as-is.

feuler commented 12 months ago

@joshuaboniface

Used: https://github.com/onnx/tensorflow-onnx

command: python -m tf2onnx.convert --opset 7 --tflite /path/to/computer_v2.tflite --output /path/computer_converted.onn x

sujitvasanth commented 10 months ago

@kthhrv https://www.youtube.com/watch?v=LkqiDu1BQXY

ITHealer commented 10 months ago

Do you have a notebook file that contains the entire process of creating a wake word model? I want to custom for my wakeword. Thanks!

ITHealer commented 10 months ago

I see that after running the train file for the word marvin, but when streaming and speaking the words in the word list, it still detects.

dscripka commented 10 months ago

@ITHealer, if you are having an issue with the performance of a model after training, can you create a separate issue and give more context about problem?