Open ljj7975 opened 3 years ago
segfault was happening due to numba https://github.com/numba/numba/issues/4323
Spent some time applying one of the multiprocessing package but the results weren't that good please refer to https://github.com/castorini/howl/tree/multi_processing_test
When writing a dataset, process function should also take in sample (AudioClipExample) and use sample.audio_data when metadata.path does not exist (https://github.com/castorini/howl/blob/master/howl/data/dataset/serialize.py#L67-L72)
A simple way to speed this up is to call out to ffmpeg in AudioDatasetWriter
rather than doing the conversions in Python (which is slow).
create_raw_dataset.py takes quite a long time to generate datasets.
I thinking multi-threading AudioDatasetMetadataWriter write will do the job.
Also, this process terminates with segfault