Closed phtephanx closed 5 years ago
Thanks for starting a discussion on this.
I would also prefer fraction of signal masked
as setting as it is will be more meaningful when we change the signal length. It might be not exactly correct if we allow for complete random placing of the single blocks as they might overlap. But in that case I would argue for ignoring that and calculate the fraction
always by comparing number_of_blocks * block_width
with signal width
.
To make it simple and random I would drop the option that ensures blocks don't overlap.
Otherwise we would have to discuss for min_dist
how to measure it (block center or edge) and if we go for edge should there be a way of specifying a negative min_dist
to allow for overlapping blocks, and so forth.
In addition, I would make the max_width
and axis
arguments optional, this would change your interface proposal to:
RandomMask(coverage, *, max_width=, axis=-1)
"""
Args:
coverage (float): fraction of signal to mask
max_width (int): maximum block size
axis (int): axis to mask signal along
"""
I would also add the following points/questions to your proposal:
max_width
? In samples, time in seconds, or as a fraction of the actual signal length as the parameter p
seems to be doing in the paper you cited.torch
so we don't have to change anything later?To make it simple and random I would drop the option that ensures blocks don't overlap.
Otherwise we would have to discuss for min_dist how to measure it (block center or edge) and if we go for edge should there be a way of specifying a negative min_dist to allow for overlapping blocks, and so forth.
I see. The case of overlapping won't occur frequently for (common) low coverage, anyway. I might do some experiments with min_dist
apart from this implementation.
How should we measure max_width? In samples, time in seconds
It might make sense to use masking also in the time domain (and not only in the frequency domain like in the paper) since the impacts on the signal can be different (depending on block_width
, window_size
, hop_size
).
If we want to enable masking also in the time domain , then we could (a) not specify the unit, (b) have e.g. one class for TimeDomainMask
and another for FrequencyDomainMask
.
or as a fraction of the actual signal length as the parameter p seems to be doing in the paper you cited.
I'd propose to use an absolute value for width (either samples in time domain or frames in frequency domain) instead of propotion p
of the signal length. In ASR one might e.g. mask phonemes, and the length of phonemes does not vary so much. However, when we use the width as a fraction of the signal length, the width is large for long signals and small for short signals.
Should we start implementing it directly in torch so we don't have to change anything later?
Is it the strategy to avoid numpy and use only torch? Then yes, no objections.
How should we measure max_width? In samples, time in seconds
It might make sense to use masking also in the time domain (and not only in the frequency domain like in the paper) since the impacts on the signal can be different (depending on
block_width
,window_size
,hop_size
). If we want to enable masking also in the time domain , then we could (a) not specify the unit, (b) have e.g. one class forTimeDomainMask
and another forFrequencyDomainMask
.
I think even for the FrequencyDomainMask
we have to ask if we should use a unit or not. If you specify your masks in samples and then change the sampling rate it results in less or more parts are masked.
So far we used samples, e.g. RandomCrop
. For now I would propose to do the same here.
or as a fraction of the actual signal length as the parameter p seems to be doing in the paper you cited.
I'd propose to use an absolute value for width (either samples in time domain or frames in frequency domain) instead of propotion
p
of the signal length. In ASR one might e.g. mask phonemes, and the length of phonemes does not vary so much. However, when we use the width as a fraction of the signal length, the width is large for long signals and small for short signals.
Good point, I would also down vote signal proportion then.
Should we start implementing it directly in torch so we don't have to change anything later?
Is it the strategy to avoid numpy and use only torch? Then yes, no objections.
Yes, and with torch you can decide yourself if it should run on CPU or GPU.
BTW, in keunwoochoi/torchaudio-contrib they had the idea of adding an option where you can decide if the same masking should be applied to all batches or not.
Regarding the naming of RandomMask
, I guess we have to be a little bit more specific as it can be applied in different domains and along different axis, so maybe something like this?
MaskSpectrogramTime
MaskSpectrogramFrequency
MaskSignal
or MaskWaveform
I think even for the FrequencyDomainMask we have to ask if we should use a unit or not. If you specify your masks in samples and then change the sampling rate it results in less or more parts are masked.
So far we used samples, e.g. RandomCrop. For now I would propose to do the same here.
I see. The sampling_rate
should be consistent (i.e. constant) throughout the entire training data set, shouldn't it? We wouldn't do RandomResampling
? If sr
is constant, then the parameter max_width
is also consistent for the entire training data set. It's dependent on the sampling_rate
, yes. If we leave it up to the user, to choose a reasonable max_width
parameter for his sampling_rate
, then we wouldn't need to resolve it in the function. Is that what you mean?
BTW, in keunwoochoi/torchaudio-contrib they had the idea of adding an option where you can decide if the same masking should be applied to all batches or not.
They are treating their transformations as layers, right? Then it's easier in their code to implement that. I found one corresponding issue and the PR in their repository but no discussion why it makes sense to apply the same masking to all samples in the batch. I don't see any reason to apply the same masking across the batch except reduction of computational effort. It should be fast enough, anyway. So, I'd always create a new random masking for each data point. Can you think of any other advantage?
Regarding the naming of RandomMask, I guess we have to be a little bit more specific as it can be applied in different domains and along different axis, so maybe something like this?
I agree with the naming. Still, what will be different in those functions except the variable names? Also axis
could an argument like in RandomCrop
.
We could resolve the axis
for the user since SpectrogramTime
and SpectrogramFrequency
-axis are dependent on librosa.magphase
, right? Plus, it will be more readable when specifying Compose([MaskSpectrogramTime(foo), MaskSpectrogramFrequency(bar)])
instead of Compose(RandomMask(axis=-1), RandomMask(axis=1)]
.
I think even for the FrequencyDomainMask we have to ask if we should use a unit or not. If you specify your masks in samples and then change the sampling rate it results in less or more parts are masked. So far we used samples, e.g. RandomCrop. For now I would propose to do the same here.
I see. The
sampling_rate
should be consistent (i.e. constant) throughout the entire training data set, shouldn't it? We wouldn't doRandomResampling
? Ifsr
is constant, then the parametermax_width
is also consistent for the entire training data set. It's dependent on thesampling_rate
, yes. If we leave it up to the user, to choose a reasonablemax_width
parameter for hissampling_rate
, then we wouldn't need to resolve it in the function. Is that what you mean?
I had in mind training the same model but switch to another data set with a different sampling rate. But as seen at other places this has other problems as well and for now I would propose to stick with samples.
BTW, in keunwoochoi/torchaudio-contrib they had the idea of adding an option where you can decide if the same masking should be applied to all batches or not.
They are treating their transformations as layers, right? Then it's easier in their code to implement that. I found one corresponding issue and the PR in their repository but no discussion why it makes sense to apply the same masking to all samples in the batch. I don't see any reason to apply the same masking across the batch except reduction of computational effort. It should be fast enough, anyway. So, I'd always create a new random masking for each data point. Can you think of any other advantage?
No, I did not really think about it, I just wanted to point you to that example. If you don't see any advantage just leave it out.
Regarding the layer stuff I'm wondering if you could still use a transform in the normal way on a data set when we implement it as a layer or not. It would of course provide some flexibility if you could switch between making the transform part of your model or not. But as we don't have this option for the other transforms at the moment, I would propose to not start implenting it as layer. Or what is your take on that?
Regarding the naming of RandomMask, I guess we have to be a little bit more specific as it can be applied in different domains and along different axis, so maybe something like this?
I agree with the naming. Still, what will be different in those functions except the variable names? Also
axis
could an argument like inRandomCrop
.We could resolve the
axis
for the user sinceSpectrogramTime
andSpectrogramFrequency
-axis are dependent onlibrosa.magphase
, right? Plus, it will be more readable when specifyingCompose([MaskSpectrogramTime(foo), MaskSpectrogramFrequency(bar)])
instead ofCompose(RandomMask(axis=-1), RandomMask(axis=1)]
.
I agree with both of your comments. The underlying principle is always the same and we should be able to have just one functional for all those transforms. For the transforms itself I would create different ones as the code reads better and also the documentation is easier to read if you can talk directly about time or frequency and not an abstract dimension.
I have dropped Random
from my proposals. Maybe we should add it again as so far all our random transforms contained it, it would then be
RandomMaskSpectrogramTime
RandomMaskSpectrogramFrequency
RandomMaskSignal
or RandomMaskWaveform
Too long?I had in mind training the same model but switch to another data set with a different sampling rate. But as seen at other places this has other problems as well and for now I would propose to stick with samples.
Good.
Regarding the layer stuff I'm wondering if you could still use a transform in the normal way on a data set when we implement it as a layer or not. It would of course provide some flexibility if you could switch between making the transform part of your model or not. But as we don't have this option for the other transforms at the moment, I would propose to not start implenting it as layer. Or what is your take on that?
I think, in a naive way it's straightforward to make it work as a layer and as part of our current dataloader-preprocessing. In a for-loop one would slice along the batch dimension (thereby remove the batch dimension) and hand the single sample to our transformations. Then, however, we won't benefit from the gpu's parallelization on matrix ops. Maybe there won't be any speed-up.
It's tricky to make the transform agnostic to the batch dimension, i.e. make it work independently of whether the tensor has a batch dimension or not. For now I'd also propose to start with our current approach.
I agree with both of your comments. The underlying principle is always the same and we should be able to have just one functional for all those transforms. For the transforms itself I would create different ones as the code reads better and also the documentation is easier to read if you can talk directly about time or frequency and not an abstract dimension.
Good. I'll propose a PR and we can continue discussing there.
RandomMaskSpectrogramTime
RandomMaskSpectrogramFrequency
RandomMaskSignal or RandomMaskWaveform
Hmm. Maybe we'd better drop the term Random
. We probably won't implement a Non-Random Masking..
RandomMaskSpectrogramTime RandomMaskSpectrogramFrequency RandomMaskSignal or RandomMaskWaveform
Hmm. Maybe we'd better drop the term
Random
. We probably won't implement a Non-Random Masking..
Yes, makes sense, go ahead and make a PR.
Feature
Time and frequency masking of
SpecAugment
(https://arxiv.org/pdf/1904.08779.pdf)Algorithm
Given:
maximum width
of block Steps:width
~ uniform[0,maximum width
)start
~ uniform[0, signal size -width
)So far, so good.
Implementation
Masking only a single (short) block per signal might not be effective. Instead one rather wants to mask several short blocks per signal
One could go for:
I find 2. more reasonable since the impact of 1. depends on the length of the signal. 5 blocks mask a larger fraction in a short signal than in a long one. Furthermore, in the implementation of [0], it's not guaranteed that blocks do not overlap. If blocks overlap, the entire content of relevant information could be lost.
Proposed Interface
How important do you find "number of blocks per signal" vs "fraction of signal masked"? And do overlapping blocks are undesired in your opinion?