MyrtleSoftware / myrtlespeech

Speech recognition
Other
8 stars 1 forks source link

Speed Perturbation #28

Open julianmack opened 4 years ago

julianmack commented 4 years ago

Adds speed perturbation.

This introduces non-trivial changes to the pre-processing pipeline so it's worth giving a bit of background on why I've taken this approach. Some of this is a repetition of an earlier slack message.

TL;DR: it is necessary to use torchaudio.sox_effect_chain which adds a large amount of complexity.

'Simpler' alternatives to using sox

I've tried x2 multiple other ways of performing speed-perturbation: 1) Using another library librosa. NVIDIA use this (https://github.com/ryanleary/mlperf-rnnt-ref/blob/fe0cc4145c240d4f8a8fe1814f397df63095e220/parts/perturb.py#L42) 2) Using torchaudio.Resample to perform directly on the input tensor.

Both of these are very slow. 1. converts to the frequency domain and back while 2. is even slower when upsampling the signal (i.e. making it slower). For reference on copernicus the methods are respectively x20 and x300 slower (!) than the sox implementation and the dataloaders become the limiting factors during training. For comparison, the sox version does add some overhead but this is acceptable (+25% time per epoch - and this includes the fact that some sequences are 15% longer.)

A third potential method (which NVIDIA also use: https://github.com/ryanleary/mlperf-rnnt-ref/blob/fe0cc4145c240d4f8a8fe1814f397df63095e220/utils/preprocessing_utils.py#L52) is performing the perturbation offline. This seems like a poor choice to me since: a) Each training sample has a fixed speed change - reducing augmentation effectiveness b) This isn't scalable with training set size (to 60k/100k hrs) as the multiple dataset copies won't fit on the disk of a single machine.

Necessary changes

The complexity added by sox_effects_chain is that it must be applied on a filepath rather than a tensor. To deal with this I've split the audio transforms into two types: 1) pre_load_transforms - speed_perturbation is of this type 2) post_load_transforms - all previous transforms are of this type

FYI, the high-level API treats speed_pertubation in exactly the same way as other steps but I've found it necessary for the builders + dataset to have knowledge of the two transform types.

It is also necessary to add a worker_init_fn to avoid seg_faults when sox is being used :scream: - I think the lack of this fn lead samG to think that sox wasn't thread-safe.