Handle Multi-Channel Audio Files Better

fastaudio / fastai_audio

[DEPRECATED] 🔊️ Audio with fastaiv1

MIT License

160 stars 49 forks source link

Handle Multi-Channel Audio Files Better #34

Closed kevinbird15 closed 5 years ago

kevinbird15 commented 5 years ago

A current problem I am dealing with is multi-channel audio files. The current method for handling this with fastai_audio is to compress them all into one channel, but I think this could be much more valuable to keep separately especially when multiple people are having a conversation and are being recorded on different channels. How should fastai handle this without compressing these down into one channel?

My thought is that each channel would be its own tensor so if there is a single channel, the shape would be (1,# of frames) when you added the second channel in, it would be (2,# of frames). This would allow us to easily keep everything on the same timeline and make it generalizable to N channels. How many places would this need to be handled? I will start working through this if it is something that has interest. My goal is to not have different code for 1-channel vs N-channel processors so I would want it to handle that the same way if possible. Interested in getting feedback from other people that are familiar with the library.

kevinbird15 commented 5 years ago

I just looked to see how torchaudio handles 1-channel audio and they keep the channel so I think that would make a lot of sense for us to move towards.

kevinbird15 commented 5 years ago

This is the test that I looked at for this:

tst,sr = torchaudio.load("data/clips/data/.cache/sh-all_20-200/common_voice_en_18752090.mp3-7650a11c561d129f5a343da8dc89affb/0.wav")

print(tst.shape)

tst2 = AudioItem.open("data/clips/data/.cache/sh-all_20-200/common_voice_en_18752090.mp3-7650a11c561d129f5a343da8dc89affb/0.wav")

print(tst2.data.shape)

Output:

torch.Size([1, 124800])
torch.Size([124800])

rbracco commented 5 years ago

Definitely worth working on, we compressed to 1 channel solely as a short-term fix with plans to make it work for multichannel in future, it was just never priority as most datasets are mono and compressing stereo to mono doesn't seem to cause much loss. The ideal is to handle multichannel as you said, in multiple channels, with a smooth user experience. That would ideally be same code for mono/non-mono datasets, and, as one of the goals of the library is making things as easy as possible for people who aren't experts, make it so even if the user doesn't know the difference between multi-channel/mono, it still functions well.

Please keep us posted on your progress, and I'll keep it in mind when looking through the codebase.

rbracco commented 5 years ago

Now that PR #35 is added we should be able to add in multichannel functionality. Changes needed will be

replacing DownmixMono with a function taking the channel mean of the various channels.
Adding a mono boolean to the AudioConfig for downmixing to one channel for those that still want to do so
Fixing all the things that multiple channel audio will break in terms of displaying spectrograms, spectrogram augmentations like tfm_freq_mask, tfm_time_mask and tfm_sg_roll and so on.

Is this something you want to work on @kevinbird15? If so I'm happy to help provide support.

hiromis commented 5 years ago

I think I'll shadow @kevinbird15 on this one to get my feet wet :)

kevinbird15 commented 5 years ago

I believe DownmixMono will still work for our task. I think the function would be better replaced in the upgrade to torchaudio 0.3.0 PR since that is where it actually breaks.

The mono flag already exists and is called downmix. We will make sure that it acts as expected.

The transforms is where a lot of the work will be for this. Working through these changes now.