Added Google Speech Commands V2 Dataset

harritaylor commented 5 years ago

As title, added google speech commands dataset, with some choice designs for preprocessing as described in the next section.

Added speech commands dataset
Dataset partitioning into train / test using testing_list.txt from the dataset download
- validation_list.txt is still provided for use to use sampler in the dataloader if train/valid/test is desired
Using include keyword for defining custom subset of dataset. All other words in the dataset are marked as unknown (not sure if this adheres to the original split, but couldn't find more information about this)
Using silence keyword to include samples from _background_noise_ folder

silence: currently the DataSet simply points to samples in the _background_noise_ folder. These samples are 1min long, whereas the speech commands are 1sec long. My current workaround is to use RandomCrop with 16,000 samples, which deals with this issue nicely. I don't think it would be efficient to chop up and store up 1 second clips of the silence clips.

hagenw commented 5 years ago

Cool, thanks for your hard work. I added just one small comment.

And if you like, feel free to add your name in LICENSE under the Contributors: section.

harritaylor commented 5 years ago

Cool, thanks!

audeering / audtorch