Closed iver56 closed 3 years ago
How should we represent multichannel audio in spectrograms though? I would suggest channels last, since that is very common in computer vision, and is compatible with the way conv2d is implemented in pytorch and tensorflow
For waveforms, I'd prefer (batch, channels, samples)
I would also suggest channel first for the multichannel spectrograms:
audio : (batch, channels, samples) -> spec-like : (batch, channel, freqs, time)
This would at least be consistent with Asteroid's filterbank API and the vision of TF transforms as 1D convolutions. CMIIW but Conv2d is channel first by default in pytorch.
Thank you for your insights :) I haven't actually done much computer vision in pytorch, so I was probably wrong in my previous comment. Btw, Keras supports both channels_first and channels_last when running/training computer vision models. It's a config parameter one can set.
According to this docs page, this will also be the case for PyTorch. So it's a choice that we have to make.
One advantage of channels last is that it's easy to write a 2-channel spectrogram to PNG, since it's already in a compatible format 😄
Example:
If we don't want to make this decision, I guess an "easy" way out would be to support both (have it as a parameter), and permute axes when needed.
Channels first pros:
Channels last pros:
Pillow
)Some sources regarding performance:
Is there anything else that is worth mentioning here?
Channel first pros: only need to reshape to batch dimension for applying an augmentation to each channel independently. Channel last you have to permute then reshape.
Another pro of channels first: It's consistent with waveforms having channels first
So I'm going to go ahead and call the shot (at least in the short term): Support channels first in spectrogram transforms.
Since multichannel audio is already a first-class citizen in torch-audiomentations, I'll close this issue now
There are several ways of representing multichannel audio:
Channels last
shape like [batch_size, num_samples, num_channels]
Samples last
shape like [batch_size, num_channels, samples]
I would choose samples last based on this information