YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.17k stars 221 forks source link

MixUp Waveform Length Matching #39

Closed aishwaryajadhav closed 3 years ago

aishwaryajadhav commented 3 years ago

When specifying mixup>0, the code tries to load 2 audio files and if they are not the same length tries to scale waveform2 to the same shape as waveform1. There is a minor bug in the code that does this:

 if waveform1.shape[1] != waveform2.shape[1]:
        if waveform1.shape[1] > waveform2.shape[1]:
            # padding
            temp_wav = torch.zeros((1,waveform1.shape[1]))
            temp_wav[0, 0:waveform2.shape[1]] = waveform2
            waveform2 = temp_wav
        else:
            # cutting
            waveform2 = waveform2[0, 0:waveform1.shape[1]]

In the above snippet, lines 4, 5, 9, don't work where the 1st dimension of the waveforms >1. Following minor tweaks should help:

if waveform1.shape[1] != waveform2.shape[1]:
      if waveform1.shape[1] > waveform2.shape[1]:
          # padding
          temp_wav = torch.zeros(waveform1.shape)
          temp_wav[:, 0:waveform2.shape[1]] = waveform2
          waveform2 = temp_wav
      else:
          # cutting
          waveform2 = waveform2[:, 0:waveform1.shape[1]]
YuanGongND commented 3 years ago

Hi there,

You are correct that these lines of code only work for single-channel audio, but that was intentional, as we want to get a single spectrogram from the waveform. So the code implicitly gets the first channel of the audio and abandons the other channels. I will add a comment on these codes.

Thanks for your suggestion.

-Yuan