Support multichannel audio

asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.

MIT License

943 stars 87 forks source link

Support multichannel audio #2

Closed iver56 closed 3 years ago

iver56 commented 4 years ago

There are several ways of representing multichannel audio:

Channels last

shape like [batch_size, num_samples, num_channels]

Common when loading/saving audio from/to wav

Samples last

shape like [batch_size, num_channels, samples]

Convenient for convolution. E.g. for STFT, the time dimension is commonly the last dimension (in torchaudio, torch-stft and asteroid's STFT), according to @mpariente
"I always prefer channel first because in case one wants to apply an operation to each channel independently can do straight reshaping. Otherwise you have to transpose then reshape." - @popcornell
Librosa uses this convention

I would choose samples last based on this information

iver56 commented 4 years ago

How should we represent multichannel audio in spectrograms though? I would suggest channels last, since that is very common in computer vision, and is compatible with the way conv2d is implemented in ~~pytorch~~ and tensorflow

mpariente commented 4 years ago

For waveforms, I'd prefer (batch, channels, samples)

I would also suggest channel first for the multichannel spectrograms:

audio : (batch, channels, samples) -> spec-like : (batch, channel, freqs, time)

This would at least be consistent with Asteroid's filterbank API and the vision of TF transforms as 1D convolutions. CMIIW but Conv2d is channel first by default in pytorch.

iver56 commented 4 years ago

Thank you for your insights :) I haven't actually done much computer vision in pytorch, so I was probably wrong in my previous comment. Btw, Keras supports both channels_first and channels_last when running/training computer vision models. It's a config parameter one can set.

mpariente commented 4 years ago

According to this docs page, this will also be the case for PyTorch. So it's a choice that we have to make.

iver56 commented 4 years ago

One advantage of channels last is that it's easy to write a 2-channel spectrogram to PNG, since it's already in a compatible format 😄

Example:

darthuizen_audio__wavs_2662890367_382eaf83bd_0

iver56 commented 4 years ago

If we don't want to make this decision, I guess an "easy" way out would be to support both (have it as a parameter), and permute axes when needed.

iver56 commented 4 years ago

Channels first pros:

This is typically what you get out of conv1d-based STFT implementations in pytorch
GPU (cuDNN) without Nvidia tensor cores support (e.g. Nvidia Pascal and earlier): channels first is faster

Channels last pros:

CPU with AVX/SSE: channels last is faster
GPU (cuDNN) with Nvidia tensor cores support (e.g. new GPU generations like Nvidia Turing, Volta and Ampere): channels last is faster
Same shape as typical image formats (e.g. in Pillow)

Some sources regarding performance:

Is there anything else that is worth mentioning here?

popcornell commented 4 years ago

Channel first pros: only need to reshape to batch dimension for applying an augmentation to each channel independently. Channel last you have to permute then reshape.

iver56 commented 3 years ago

Another pro of channels first: It's consistent with waveforms having channels first

So I'm going to go ahead and call the shot (at least in the short term): Support channels first in spectrogram transforms.

Since multichannel audio is already a first-class citizen in torch-audiomentations, I'll close this issue now