YuanGongND / ssast

Code for the AAAI 2022 paper "SSAST: Self-Supervised Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
357 stars 58 forks source link

Stereo audio #18

Closed matthiasanderer closed 1 year ago

matthiasanderer commented 1 year ago

Would this also work for stereo (i.e. 2 channel) audio?

I wonder how to best adapt the code to this. (Especially that the timm parts have been trimmed down from 3 channels to 1 channel anyway)

YuanGongND commented 1 year ago

Hi,

I think it is doable, even with our pretrained model.

  1. These are where we select the first channel, you need to change these.

https://github.com/YuanGongND/ssast/blob/a1a3eecb94731e226308a6812f2fbf268d789caf/src/dataloader.py#L112

https://github.com/YuanGongND/ssast/blob/a1a3eecb94731e226308a6812f2fbf268d789caf/src/dataloader.py#L116

  1. You also need to work on fbank extraction to make sure the output is two channel.

https://github.com/YuanGongND/ssast/blob/a1a3eecb94731e226308a6812f2fbf268d789caf/src/dataloader.py#L126

This includes a new dim which were squeezed for single-channel fbanks. So you also need to take care of the input pre-processing at the model side

https://github.com/YuanGongND/ssast/blob/a1a3eecb94731e226308a6812f2fbf268d789caf/src/models/ast_models.py#L436

Note we did this for multiple forward pass and above is just one of them.

  1. Then you need to change the model size to take two channels instead of one.

https://github.com/YuanGongND/ssast/blob/a1a3eecb94731e226308a6812f2fbf268d789caf/src/models/ast_models.py#L130

In short, it needs some (careful) changes of the code, but is doable. I am not sure about your purpose, but it will be easier if you can add the two channels as a single channel.

-Yuan