castorini / howl

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.
Mozilla Public License 2.0
201 stars 30 forks source link

Dataset Generation Question #111

Open bdytx5 opened 2 years ago

bdytx5 commented 2 years ago

First off, thanks for this awesome repo! Helping me a lot with my project!!!

Anyway, I'm a bit confused as to how the program is generating the samples that it does. For example, I chose a single wake word and generated a dataset from the speech commands dataset. For the positive set, I get

Generate training datasets: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 509/509 [01:03<0 "Number of speakers in corpus: 1, average number of utterances per speaker: 518.0."

However, when I follow the rest of the generation steps, I end up with a dataset of 10K examples. Im just a bit confused as to where these extra samples came from? Are they duplicates or some sort of augmented version of themselves? In the paper you mention- "For improved robustness and better quality, we implement a set of popular augmentation routines: time stretching, time shifting, synthetic noise addition, recorded noise mixing, SpecAugment (no time warping; Park et al., 2019), and vocal tract length perturbation (Jaitly and Hinton, 2013). These are readily extensible, so practitioners may easily add new augmentation modules."

I am mainly using this repo for dataset generation, so I wasn't sure if this was just talking about your model preprocessing, or if you perhaps implemented this in your dataset generation code.... I would dig through the code a bit more, but I figured it would be pretty quick/straightforward question for you guys and possibly be useful for someone else down the line....

Thanks, Brett

ljj7975 commented 2 years ago

I haven't generated a dataset from google speech commands for a long time.

what were the commands you used and what was the target word?

The augmentations are applied at the time of model training in memory. The dataset we generate is original where the augmentations have not been applied yet.

bdytx5 commented 2 years ago

Ahh sorry, I meant the common voice dataset, not speech commands...

I was generating a set for the word "pass" - here are the commands I used.. These may not work for you, as I am using a previous version of the repo...

VOCAB='["pass"]' INFERENCE_SEQUENCE=[0] python -m training.run.generate_raw_audio_dataset -i /home/brett/datasets/common_voice/en --positive-pct 100 --negative-pct 0

mkdir -p datasets/pass/positive/alignment pushd montreal-forced-aligner ./bin/mfa_align --num_jobs 12 ../datasets/pass/positive/audio librispeech-lexicon.txt pretrained_models/english.zip ../datasets/pass/positive/alignment popd

(SKIP- I am going to reuse negative samples from other datasets) DATASET_PATH=datasets/pass/negative python -m training.run.attach_alignment --align-type stub

VOCAB='["pass"]' DATASET_PATH=datasets/pass/positive python -m training.run.attach_alignment --align-type mfa -i /home/brett/Desktop/howl/datasets/pass/positive/alignment

VOCAB='["pass"]' INFERENCE_SEQUENCE=[0] python -m training.run.stitch_vocab_samples --aligned-dataset "datasets/pass/positive" --stitched-dataset "data/pass-stitched"

ljj7975 commented 2 years ago

Would this script will solve your issue?

https://github.com/castorini/howl/blob/master/generate_dataset.sh

Details can be found here: https://github.com/castorini/howl/tree/master/howl/dataset

bdytx5 commented 2 years ago

I don't really have an issue, it just seemed strange that I had 20x more samples after stitching than before. Do you see what I am saying?

boxabirds commented 2 years ago

Data augmentation generally results in a combinatorial increase in data sets. You quoted the doc listing 6 augmentation methods so the combinatorial expansion of that is 1 + 2 + 3 + 4 + 5 + 6 = 21. I can easily see from the docs that you'd get 21 additional variations.

I might be completely off piste here but it makes sense at least intuitively for me. It also says they're configurable so you could check specifically which ones are enabled and which you might want to exclude for your situation, though it looks like a pretty comprehensive set of manipulations (SpecAugment is particularly neat as that's a relatively new technique).

ljj7975 commented 2 years ago

Sorry for the delay. I have other stuff going on that I didn't get to spent too much time with this project.

I think you just caught a bug. @bdytx5 is right that it's coming from stitching.

It's supposed to mix the audio for each word (in the vocab) in the dataset to create a new set of audio samples If the wake word is "hello world" and we have 2 positive audio samples (sample 1 and sample 2), It's supposed to generate two more samples: "hello" from sample 1 and "world" from sample 2 & "hello" from sample 2 and "world" from sample 1. The new dataset is stored under stitched folder.

The bug is simply coming from not handling the base case where there is only one word in the vocab. If you listen to the generated audio files in the stitched folder, you will notice that they just contain the word you specified ("pass" based on your command)

The detailed implementation of stitching logic can be found in the class, WordStitcher. You will probably find stitch_vocab_samples.py script and the test case for the WordStitcher resourceful as well.

I will fix the issue up sometime next week.