YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.13k stars 212 forks source link

[Question] Question about padding operation #67

Open Mountchicken opened 2 years ago

Mountchicken commented 2 years ago

Hi For audio of different lengths, the padding operation in dataset is taken on fbank. So why not padding on waveform first and then convert it to fbank.

YuanGongND commented 2 years ago

Hi Qing,

Thanks for the question.

I think either way works. The reason that I chose to pad fbank rather than waveform was just to explicitly control the input shape (on the time dimension) of the network.

Is there an advantage you think padding the waveform would have?

-Yuan

Mountchicken commented 2 years ago

Hi @YuanGongND Thanks for the prompt reply.

YuanGongND commented 2 years ago

In CV, we resize and pad images of different sizes to form a batch, and in speech, fbank is also similar to image as it's a 2D tensor, so it is reasonable to pad. But my concern is that fbank is generated from waveform. For waveform, padding is more like to continue recording a bit more sound after the recording is finished. But padding the fbank directly seems to be less intuitive.

I agree that padding the waveform could be better for the last element of the tensor. But in practice, for a sequence of hundreds of elements, the impact is minor.

And I got another naive question, does it make sense to resize fbank to a target size, just like what is done in CV?

I do not believe so, for the frequency dimension, all samples should be the same, so no need to resize, for the time dimension, resizing means time warping, which is usually undesired.

-Yuan