Open sanjuktasr opened 4 months ago
Reproducing this dataset would require manually downloading the data from each of those sources, requiring terabytes of disk space and reasonably powerful hardware to process and prepare this data for training openWakeWord models.
I wouldn't recommend this unless you are interesting in conducting your own extensive experiments. Continued testing since the release of openWakeWord has led me to believe that this volume of data may not be necessary for training well performing models.
Can you give me an idea of the datasets(positive and negative) that is needed to be used for a decent performing model? and how to design the experiment?
The automatic model training notebook works reasonably well with ~10,000 positive samples and ~2,000 hours of negative data from the ACAV100M dataset. Of course, certain wake word phrases may require more or less data.
I would recommend starting there and getting familiar with train.py
for any experiments you wish to perform.
While running test code on audios, I am always getting 10^-2 - 10^-3 range probabilities for the audio chunks. no data is giving high probability. I have used 32 layer dnn. What could be the possible error? Is because I have given too much negative data and too less positive? I am using validation_set_features.npy
I had such small probabilities when wav data for testing was encoded/decoded incorrectly, i.e. seriours bug. I would suggest for you to start with https://github.com/dscripka/openWakeWord/blob/main/notebooks/automatic_model_training.ipynb and/or https://github.com/dscripka/openWakeWord/blob/dc5a23421821dd3ebf49f17e949fe95e6e5320cf/notebooks/training_models.ipynb to be sure that the entire process of training and testing works fine
The model was trained on approximately ~31,000 hours of negative data, with the approximate composition shown below:
In addition to the above, the total negative dataset also includes reverberated versions of the ACAV100M dataset (also using the simulated room impulse responses from the BIRD Impulse Response Dataset dataset). Currently, adversarial STT generations were not added to the training data for this model. How to reproduce the negative dataset? Also how much memory in gbs will the DB require?