asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
MIT License
943 stars 87 forks source link

Support multichannel audio #2

Closed iver56 closed 3 years ago

iver56 commented 4 years ago

There are several ways of representing multichannel audio:

Channels last

shape like [batch_size, num_samples, num_channels]

Samples last

shape like [batch_size, num_channels, samples]


I would choose samples last based on this information

iver56 commented 4 years ago

How should we represent multichannel audio in spectrograms though? I would suggest channels last, since that is very common in computer vision, and is compatible with the way conv2d is implemented in pytorch and tensorflow

mpariente commented 4 years ago

For waveforms, I'd prefer (batch, channels, samples)

I would also suggest channel first for the multichannel spectrograms:

audio : (batch, channels, samples) -> spec-like : (batch, channel, freqs, time)

This would at least be consistent with Asteroid's filterbank API and the vision of TF transforms as 1D convolutions. CMIIW but Conv2d is channel first by default in pytorch.

iver56 commented 4 years ago

Thank you for your insights :) I haven't actually done much computer vision in pytorch, so I was probably wrong in my previous comment. Btw, Keras supports both channels_first and channels_last when running/training computer vision models. It's a config parameter one can set.

mpariente commented 4 years ago

According to this docs page, this will also be the case for PyTorch. So it's a choice that we have to make.

iver56 commented 4 years ago

One advantage of channels last is that it's easy to write a 2-channel spectrogram to PNG, since it's already in a compatible format 😄

Example:

darthuizen_audio__wavs_2662890367_382eaf83bd_0

iver56 commented 4 years ago

If we don't want to make this decision, I guess an "easy" way out would be to support both (have it as a parameter), and permute axes when needed.

iver56 commented 4 years ago

Channels first pros:

Channels last pros:

Some sources regarding performance:

Is there anything else that is worth mentioning here?

popcornell commented 4 years ago

Channel first pros: only need to reshape to batch dimension for applying an augmentation to each channel independently. Channel last you have to permute then reshape.

iver56 commented 3 years ago

Another pro of channels first: It's consistent with waveforms having channels first

So I'm going to go ahead and call the shot (at least in the short term): Support channels first in spectrogram transforms.

Since multichannel audio is already a first-class citizen in torch-audiomentations, I'll close this issue now