Alexa dataset overlap with Kaggle, plus training question.

Picovoice / wake-word-benchmark

wake word engine benchmark framework

https://picovoice.ai/

Apache License 2.0

131 stars 28 forks source link

Alexa dataset overlap with Kaggle, plus training question. #11

Closed ilyakava closed 2 years ago

ilyakava commented 3 years ago

Thanks for this awesome benchmarker! I'd like to integrate an additional wakeword detector in this benchmark (gh link), starting with the alexa dataset.

Listening to the flac files in the audio/alexa folder it sounds like these recordings are the same as in the kaggle dataset. I do notice however that you have 329 recordings while Kaggle provides 369 wavs for download. Is this because you trained Porcupine and maybe the other 2 engines on the other 40? Or did you exclude these for noise/outlier reasons?

Is it correct to assume that Porcupine was not trained on the audio in the audio/alexa folder?

kenarsa commented 3 years ago

You are welcome. This is great. We definitely welcome that. Keep in mind that we need to be able to at the new engines to the figures at the end. So we need (1) RoC for all phrases and (2) runtime metric on Raspberry Pi3.

The Kaggle dataset is uploaded by us as well. I'm not sure why 40 discrepancies as it was a long time ago. Either bad recordings or invalid ones or could be duplicated.

The data is audio folders are all for testing and have not been used for training. hope it helps.

ilyakava commented 3 years ago

Thanks for the quick response! I have another 2 questions:

Why so much silence/negative speech over repeating the positive wakewords mixed with different noise?
How do you feel about using data augmentation like pitch/time shifting/dilation or reverb for test data? (I could PR this since I use it during training)
Can you shed some light on the training data for the model or the *ppn? How many instances of the wakeword?

I ask 3 because the detector I would like to integrate needs wakeword training data, for alexa I was using this exact kaggle dataset with a speaker split for train/test. I guess to be totally compatible with this repo I would need to use some other training data, perhaps self sourced?

kenarsa commented 3 years ago

1- How else can you measure the false alarm rate? 2- Test data is recorded on mobile and there is some ambient noise in ref data. The augmentations you mentioned will produce unrealistic effects I think. You can use them as much as you like for training though. 3- To set the expectations Picovoice is a for-profit startup and hence I won't be able to reveal how we train things. I hope you understand. But I can tell you that we don't gather data per each new model. You get a better understanding of it once you play around with Picovoice Console a bit.

Yeah, you can't use the Kaggle dataset cause its basically the test set but for the sake of this benchmark, you can use any available resources.