dscripka / openWakeWord

An open-source audio wake word (or phrase) detection framework with a focus on performance and simplicity.
Apache License 2.0
544 stars 47 forks source link

generate negative dataset for alexa #130

Open sanjuktasr opened 4 months ago

sanjuktasr commented 4 months ago

The model was trained on approximately ~31,000 hours of negative data, with the approximate composition shown below:

~10,000 hours of noise, music, and speech from the [ACAV100M dataset](https://acav100m.github.io/)
~10,000 hours from the [Common Voice 11 dataset](https://commonvoice.mozilla.org/en/datasets), representing multiple languages
~10,000 hours of podcasts downloaded from the [Podcastindex database](https://podcastindex.org/)
~1,000 hours of music from the [Free Music Archive dataset](https://github.com/mdeff/fma)

In addition to the above, the total negative dataset also includes reverberated versions of the ACAV100M dataset (also using the simulated room impulse responses from the BIRD Impulse Response Dataset dataset). Currently, adversarial STT generations were not added to the training data for this model. How to reproduce the negative dataset? Also how much memory in gbs will the DB require?

dscripka commented 4 months ago

Reproducing this dataset would require manually downloading the data from each of those sources, requiring terabytes of disk space and reasonably powerful hardware to process and prepare this data for training openWakeWord models.

I wouldn't recommend this unless you are interesting in conducting your own extensive experiments. Continued testing since the release of openWakeWord has led me to believe that this volume of data may not be necessary for training well performing models.

sanjuktasr commented 4 months ago

Can you give me an idea of the datasets(positive and negative) that is needed to be used for a decent performing model? and how to design the experiment?

dscripka commented 4 months ago

The automatic model training notebook works reasonably well with ~10,000 positive samples and ~2,000 hours of negative data from the ACAV100M dataset. Of course, certain wake word phrases may require more or less data.

I would recommend starting there and getting familiar with train.py for any experiments you wish to perform.

sanjuktasr commented 4 months ago

While running test code on audios, I am always getting 10^-2 - 10^-3 range probabilities for the audio chunks. no data is giving high probability. I have used 32 layer dnn. What could be the possible error? Is because I have given too much negative data and too less positive? I am using validation_set_features.npy

eugene-orlov-sm commented 3 months ago

I had such small probabilities when wav data for testing was encoded/decoded incorrectly, i.e. seriours bug. I would suggest for you to start with https://github.com/dscripka/openWakeWord/blob/main/notebooks/automatic_model_training.ipynb and/or https://github.com/dscripka/openWakeWord/blob/dc5a23421821dd3ebf49f17e949fe95e6e5320cf/notebooks/training_models.ipynb to be sure that the entire process of training and testing works fine