Refactor/Improve SpecAugment

Note: this issue is for the MLH fellowship

SpecAugment is a structured dropout to be applied on MelSpectogram. It masks some contiguous samples in the audio, as well as some continuous range of frequence.

It's often seen as an "augmentation" technique, but I think it could be implemented like dropout as a nn.Module, and we could put one by default in the MelSpectrogram layer.

Fairseq as an implementation, but it's a bit naive: https://github.com/facebookresearch/fairseq/blob/1164a7fc432a188d401895018eaa85175fb06f9d/fairseq/data/audio/feature_transforms/specaugment.py#L13

I'd like to see a nicer version:

extract a function doing it on one dim: structured_droupout(x, dim, p, num_mask)
make the parameters more meaningful. It should be easier to compare a "time_mask_p" in SpecAugment with dropout "p".
try to make it faster by calling randint only once

Possible follow up: implement a very fast "spec augment like" that would mask the input with a regular pattern, using just reshape and slice assignment. Compare the speed with the previous implementation.

facebookresearch / fairseq

Refactor/Improve SpecAugment #4970