speeding up create_raw_dataset.py

castorini / howl

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.

Mozilla Public License 2.0

199 stars 30 forks source link

Open ljj7975 opened 3 years ago

ljj7975 commented 3 years ago

create_raw_dataset.py takes quite a long time to generate datasets.

I thinking multi-threading AudioDatasetMetadataWriter write will do the job.

Also, this process terminates with segfault

ljj7975 commented 3 years ago

ljj7975 commented 3 years ago

ljj7975 commented 3 years ago

When writing a dataset, process function should also take in sample (AudioClipExample) and use sample.audio_data when metadata.path does not exist (https://github.com/castorini/howl/blob/master/howl/data/dataset/serialize.py#L67-L72)

ColonelThirtyTwo commented 3 years ago

A simple way to speed this up is to call out to ffmpeg in AudioDatasetWriter rather than doing the conversions in Python (which is slow).