Closed kevinbird15 closed 5 years ago
I just looked to see how torchaudio handles 1-channel audio and they keep the channel so I think that would make a lot of sense for us to move towards.
This is the test that I looked at for this:
tst,sr = torchaudio.load("data/clips/data/.cache/sh-all_20-200/common_voice_en_18752090.mp3-7650a11c561d129f5a343da8dc89affb/0.wav")
print(tst.shape)
tst2 = AudioItem.open("data/clips/data/.cache/sh-all_20-200/common_voice_en_18752090.mp3-7650a11c561d129f5a343da8dc89affb/0.wav")
print(tst2.data.shape)
Output:
torch.Size([1, 124800])
torch.Size([124800])
Definitely worth working on, we compressed to 1 channel solely as a short-term fix with plans to make it work for multichannel in future, it was just never priority as most datasets are mono and compressing stereo to mono doesn't seem to cause much loss. The ideal is to handle multichannel as you said, in multiple channels, with a smooth user experience. That would ideally be same code for mono/non-mono datasets, and, as one of the goals of the library is making things as easy as possible for people who aren't experts, make it so even if the user doesn't know the difference between multi-channel/mono, it still functions well.
Please keep us posted on your progress, and I'll keep it in mind when looking through the codebase.
Now that PR #35 is added we should be able to add in multichannel functionality. Changes needed will be
mono
boolean to the AudioConfig
for downmixing to one channel for those that still want to do sotfm_freq_mask
, tfm_time_mask
and tfm_sg_roll
and so on. Is this something you want to work on @kevinbird15? If so I'm happy to help provide support.
I think I'll shadow @kevinbird15 on this one to get my feet wet :)
I believe DownmixMono will still work for our task. I think the function would be better replaced in the upgrade to torchaudio 0.3.0 PR since that is where it actually breaks.
The mono flag already exists and is called downmix. We will make sure that it acts as expected.
The transforms is where a lot of the work will be for this. Working through these changes now.
A current problem I am dealing with is multi-channel audio files. The current method for handling this with fastai_audio is to compress them all into one channel, but I think this could be much more valuable to keep separately especially when multiple people are having a conversation and are being recorded on different channels. How should fastai handle this without compressing these down into one channel?
My thought is that each channel would be its own tensor so if there is a single channel, the shape would be (1,# of frames) when you added the second channel in, it would be (2,# of frames). This would allow us to easily keep everything on the same timeline and make it generalizable to N channels. How many places would this need to be handled? I will start working through this if it is something that has interest. My goal is to not have different code for 1-channel vs N-channel processors so I would want it to handle that the same way if possible. Interested in getting feedback from other people that are familiar with the library.