castorini / howl

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.
Mozilla Public License 2.0
199 stars 30 forks source link

speeding up create_raw_dataset.py #56

Open ljj7975 opened 3 years ago

ljj7975 commented 3 years ago

create_raw_dataset.py takes quite a long time to generate datasets.

I thinking multi-threading AudioDatasetMetadataWriter write will do the job.

Also, this process terminates with segfault

ljj7975 commented 3 years ago

segfault was happening due to numba https://github.com/numba/numba/issues/4323

ljj7975 commented 3 years ago

Spent some time applying one of the multiprocessing package but the results weren't that good please refer to https://github.com/castorini/howl/tree/multi_processing_test

ljj7975 commented 3 years ago

When writing a dataset, process function should also take in sample (AudioClipExample) and use sample.audio_data when metadata.path does not exist (https://github.com/castorini/howl/blob/master/howl/data/dataset/serialize.py#L67-L72)

ColonelThirtyTwo commented 3 years ago

A simple way to speed this up is to call out to ffmpeg in AudioDatasetWriter rather than doing the conversions in Python (which is slow).