This introduces non-trivial changes to the pre-processing pipeline so it's worth giving a bit of background on why I've taken this approach. Some of this is a repetition of an earlier slack message.
TL;DR: it is necessary to use torchaudio.sox_effect_chain which adds a large amount of complexity.
Both of these are very slow. 1. converts to the frequency domain and back while 2. is even slower when upsampling the signal (i.e. making it slower). For reference on copernicus the methods are respectively x20 and x300 slower (!) than the sox implementation and the dataloaders become the limiting factors during training.
For comparison, the sox version does add some overhead but this is acceptable (+25% time per epoch - and this includes the fact that some sequences are 15% longer.)
The complexity added by sox_effects_chain is that it must be applied on a filepath rather than a tensor. To deal with this I've split the audio transforms into two types:
1) pre_load_transforms - speed_perturbation is of this type
2) post_load_transforms - all previous transforms are of this type
FYI, the high-level API treats speed_pertubation in exactly the same way as other steps but I've found it necessary for the builders + dataset to have knowledge of the two transform types.
It is also necessary to add a worker_init_fn to avoid seg_faults when sox is being used :scream: - I think the lack of this fn lead samG to think that sox wasn't thread-safe.
Adds speed perturbation.
This introduces non-trivial changes to the pre-processing pipeline so it's worth giving a bit of background on why I've taken this approach. Some of this is a repetition of an earlier slack message.
TL;DR: it is necessary to use
torchaudio.sox_effect_chain
which adds a large amount of complexity.'Simpler' alternatives to using
sox
I've tried x2 multiple other ways of performing speed-perturbation: 1) Using another library
librosa
. NVIDIA use this (https://github.com/ryanleary/mlperf-rnnt-ref/blob/fe0cc4145c240d4f8a8fe1814f397df63095e220/parts/perturb.py#L42) 2) Usingtorchaudio.Resample
to perform directly on the input tensor.Both of these are very slow. 1. converts to the frequency domain and back while 2. is even slower when upsampling the signal (i.e. making it slower). For reference on
copernicus
the methods are respectivelyx20
andx300
slower (!) than thesox
implementation and the dataloaders become the limiting factors during training. For comparison, thesox
version does add some overhead but this is acceptable (+25% time per epoch - and this includes the fact that some sequences are 15% longer.)A third potential method (which NVIDIA also use: https://github.com/ryanleary/mlperf-rnnt-ref/blob/fe0cc4145c240d4f8a8fe1814f397df63095e220/utils/preprocessing_utils.py#L52) is performing the perturbation offline. This seems like a poor choice to me since: a) Each training sample has a fixed speed change - reducing augmentation effectiveness b) This isn't scalable with training set size (to 60k/100k hrs) as the multiple dataset copies won't fit on the disk of a single machine.
Necessary changes
The complexity added by
sox_effects_chain
is that it must be applied on a filepath rather than a tensor. To deal with this I've split the audio transforms into two types: 1)pre_load_transforms
- speed_perturbation is of this type 2)post_load_transforms
- all previous transforms are of this typeFYI, the high-level API treats
speed_pertubation
in exactly the same way as other steps but I've found it necessary for thebuilders
+dataset
to have knowledge of the two transform types.It is also necessary to add a
worker_init_fn
to avoid seg_faults when sox is being used :scream: - I think the lack of this fn lead samG to think thatsox
wasn't thread-safe.