audeering / audtorch

Utils and data sets for audio and PyTorch
https://audeering.github.io/audtorch/
Other
83 stars 9 forks source link

Time / Frequency Masking #38

Closed phtephanx closed 5 years ago

phtephanx commented 5 years ago

Feature

Time and frequency masking of SpecAugment (https://arxiv.org/pdf/1904.08779.pdf)

Algorithm

Given: maximum width of block Steps:

  1. sample width ~ uniform[0, maximum width)
  2. sample block's start index in signal: start ~ uniform[0, signal size - width)

So far, so good.

Implementation

Masking only a single (short) block per signal might not be effective. Instead one rather wants to mask several short blocks per signal

One could go for:

  1. number of blocks per signal, e.g. 5 (like https://github.com/zcaceres/spec_augment/blob/master/SpecAugment.ipynb [0])
  2. fraction of signal masked

I find 2. more reasonable since the impact of 1. depends on the length of the signal. 5 blocks mask a larger fraction in a short signal than in a long one. Furthermore, in the implementation of [0], it's not guaranteed that blocks do not overlap. If blocks overlap, the entire content of relevant information could be lost.

Proposed Interface

RandomMask(coverage, max_width, min_dist, axis)
"""
Args:
       coverage (float): fraction of signal to mask
       max_width (int): maximum block size
       min_dist (int): minimum distance between blocks
       axis (int): axis to mask signal along
"""

How important do you find "number of blocks per signal" vs "fraction of signal masked"? And do overlapping blocks are undesired in your opinion?

hagenw commented 5 years ago

Thanks for starting a discussion on this.

I would also prefer fraction of signal masked as setting as it is will be more meaningful when we change the signal length. It might be not exactly correct if we allow for complete random placing of the single blocks as they might overlap. But in that case I would argue for ignoring that and calculate the fraction always by comparing number_of_blocks * block_width with signal width.

To make it simple and random I would drop the option that ensures blocks don't overlap. Otherwise we would have to discuss for min_dist how to measure it (block center or edge) and if we go for edge should there be a way of specifying a negative min_dist to allow for overlapping blocks, and so forth.

In addition, I would make the max_width and axis arguments optional, this would change your interface proposal to:

RandomMask(coverage, *, max_width=, axis=-1)
"""
Args:
       coverage (float): fraction of signal to mask
       max_width (int): maximum block size
       axis (int): axis to mask signal along
"""

I would also add the following points/questions to your proposal:

phtephanx commented 5 years ago
To make it simple and random I would drop the option that ensures blocks don't overlap.
Otherwise we would have to discuss for min_dist how to measure it (block center or edge) and if we go for edge should there be a way of specifying a negative min_dist to allow for overlapping blocks, and so forth.

I see. The case of overlapping won't occur frequently for (common) low coverage, anyway. I might do some experiments with min_dist apart from this implementation.

How should we measure max_width? In samples, time in seconds

It might make sense to use masking also in the time domain (and not only in the frequency domain like in the paper) since the impacts on the signal can be different (depending on block_width, window_size, hop_size). If we want to enable masking also in the time domain , then we could (a) not specify the unit, (b) have e.g. one class for TimeDomainMask and another for FrequencyDomainMask.

or as a fraction of the actual signal length as the parameter p seems to be doing in the paper you cited.

I'd propose to use an absolute value for width (either samples in time domain or frames in frequency domain) instead of propotion p of the signal length. In ASR one might e.g. mask phonemes, and the length of phonemes does not vary so much. However, when we use the width as a fraction of the signal length, the width is large for long signals and small for short signals.

Should we start implementing it directly in torch so we don't have to change anything later?

Is it the strategy to avoid numpy and use only torch? Then yes, no objections.

hagenw commented 5 years ago

How should we measure max_width? In samples, time in seconds

It might make sense to use masking also in the time domain (and not only in the frequency domain like in the paper) since the impacts on the signal can be different (depending on block_width, window_size, hop_size). If we want to enable masking also in the time domain , then we could (a) not specify the unit, (b) have e.g. one class for TimeDomainMask and another for FrequencyDomainMask.

I think even for the FrequencyDomainMask we have to ask if we should use a unit or not. If you specify your masks in samples and then change the sampling rate it results in less or more parts are masked. So far we used samples, e.g. RandomCrop. For now I would propose to do the same here.

or as a fraction of the actual signal length as the parameter p seems to be doing in the paper you cited.

I'd propose to use an absolute value for width (either samples in time domain or frames in frequency domain) instead of propotion p of the signal length. In ASR one might e.g. mask phonemes, and the length of phonemes does not vary so much. However, when we use the width as a fraction of the signal length, the width is large for long signals and small for short signals.

Good point, I would also down vote signal proportion then.

Should we start implementing it directly in torch so we don't have to change anything later?

Is it the strategy to avoid numpy and use only torch? Then yes, no objections.

Yes, and with torch you can decide yourself if it should run on CPU or GPU.

BTW, in keunwoochoi/torchaudio-contrib they had the idea of adding an option where you can decide if the same masking should be applied to all batches or not.

Regarding the naming of RandomMask, I guess we have to be a little bit more specific as it can be applied in different domains and along different axis, so maybe something like this?

phtephanx commented 5 years ago
I think even for the FrequencyDomainMask we have to ask if we should use a unit or not. If you specify your masks in samples and then change the sampling rate it results in less or more parts are masked.
So far we used samples, e.g. RandomCrop. For now I would propose to do the same here.

I see. The sampling_rate should be consistent (i.e. constant) throughout the entire training data set, shouldn't it? We wouldn't do RandomResampling? If sr is constant, then the parameter max_width is also consistent for the entire training data set. It's dependent on the sampling_rate, yes. If we leave it up to the user, to choose a reasonable max_width parameter for his sampling_rate, then we wouldn't need to resolve it in the function. Is that what you mean?

BTW, in keunwoochoi/torchaudio-contrib they had the idea of adding an option where you can decide if the same masking should be applied to all batches or not.

They are treating their transformations as layers, right? Then it's easier in their code to implement that. I found one corresponding issue and the PR in their repository but no discussion why it makes sense to apply the same masking to all samples in the batch. I don't see any reason to apply the same masking across the batch except reduction of computational effort. It should be fast enough, anyway. So, I'd always create a new random masking for each data point. Can you think of any other advantage?

phtephanx commented 5 years ago
Regarding the naming of RandomMask, I guess we have to be a little bit more specific as it can be applied in different domains and along different axis, so maybe something like this?

I agree with the naming. Still, what will be different in those functions except the variable names? Also axis could an argument like in RandomCrop.

We could resolve the axis for the user since SpectrogramTime and SpectrogramFrequency-axis are dependent on librosa.magphase, right? Plus, it will be more readable when specifying Compose([MaskSpectrogramTime(foo), MaskSpectrogramFrequency(bar)]) instead of Compose(RandomMask(axis=-1), RandomMask(axis=1)].

hagenw commented 5 years ago

I think even for the FrequencyDomainMask we have to ask if we should use a unit or not. If you specify your masks in samples and then change the sampling rate it results in less or more parts are masked. So far we used samples, e.g. RandomCrop. For now I would propose to do the same here.

I see. The sampling_rate should be consistent (i.e. constant) throughout the entire training data set, shouldn't it? We wouldn't do RandomResampling? If sr is constant, then the parameter max_width is also consistent for the entire training data set. It's dependent on the sampling_rate, yes. If we leave it up to the user, to choose a reasonable max_width parameter for his sampling_rate, then we wouldn't need to resolve it in the function. Is that what you mean?

I had in mind training the same model but switch to another data set with a different sampling rate. But as seen at other places this has other problems as well and for now I would propose to stick with samples.

BTW, in keunwoochoi/torchaudio-contrib they had the idea of adding an option where you can decide if the same masking should be applied to all batches or not.

They are treating their transformations as layers, right? Then it's easier in their code to implement that. I found one corresponding issue and the PR in their repository but no discussion why it makes sense to apply the same masking to all samples in the batch. I don't see any reason to apply the same masking across the batch except reduction of computational effort. It should be fast enough, anyway. So, I'd always create a new random masking for each data point. Can you think of any other advantage?

No, I did not really think about it, I just wanted to point you to that example. If you don't see any advantage just leave it out.

Regarding the layer stuff I'm wondering if you could still use a transform in the normal way on a data set when we implement it as a layer or not. It would of course provide some flexibility if you could switch between making the transform part of your model or not. But as we don't have this option for the other transforms at the moment, I would propose to not start implenting it as layer. Or what is your take on that?

hagenw commented 5 years ago

Regarding the naming of RandomMask, I guess we have to be a little bit more specific as it can be applied in different domains and along different axis, so maybe something like this?

I agree with the naming. Still, what will be different in those functions except the variable names? Also axis could an argument like in RandomCrop.

We could resolve the axis for the user since SpectrogramTime and SpectrogramFrequency-axis are dependent on librosa.magphase, right? Plus, it will be more readable when specifying Compose([MaskSpectrogramTime(foo), MaskSpectrogramFrequency(bar)]) instead of Compose(RandomMask(axis=-1), RandomMask(axis=1)].

I agree with both of your comments. The underlying principle is always the same and we should be able to have just one functional for all those transforms. For the transforms itself I would create different ones as the code reads better and also the documentation is easier to read if you can talk directly about time or frequency and not an abstract dimension.

I have dropped Random from my proposals. Maybe we should add it again as so far all our random transforms contained it, it would then be

phtephanx commented 5 years ago
I had in mind training the same model but switch to another data set with a different sampling rate. But as seen at other places this has other problems as well and for now I would propose to stick with samples.

Good.

Regarding the layer stuff I'm wondering if you could still use a transform in the normal way on a data set when we implement it as a layer or not. It would of course provide some flexibility if you could switch between making the transform part of your model or not. But as we don't have this option for the other transforms at the moment, I would propose to not start implenting it as layer. Or what is your take on that?

I think, in a naive way it's straightforward to make it work as a layer and as part of our current dataloader-preprocessing. In a for-loop one would slice along the batch dimension (thereby remove the batch dimension) and hand the single sample to our transformations. Then, however, we won't benefit from the gpu's parallelization on matrix ops. Maybe there won't be any speed-up.

It's tricky to make the transform agnostic to the batch dimension, i.e. make it work independently of whether the tensor has a batch dimension or not. For now I'd also propose to start with our current approach.

I agree with both of your comments. The underlying principle is always the same and we should be able to have just one functional for all those transforms. For the transforms itself I would create different ones as the code reads better and also the documentation is easier to read if you can talk directly about time or frequency and not an abstract dimension.

Good. I'll propose a PR and we can continue discussing there.

RandomMaskSpectrogramTime
RandomMaskSpectrogramFrequency
RandomMaskSignal or RandomMaskWaveform

Hmm. Maybe we'd better drop the term Random. We probably won't implement a Non-Random Masking..

hagenw commented 5 years ago
RandomMaskSpectrogramTime
RandomMaskSpectrogramFrequency
RandomMaskSignal or RandomMaskWaveform

Hmm. Maybe we'd better drop the term Random. We probably won't implement a Non-Random Masking..

Yes, makes sense, go ahead and make a PR.