YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.07k stars 205 forks source link

Where reflected the variable input length input in ATSModel? #12

Closed ooobsidian closed 2 years ago

ooobsidian commented 2 years ago

Hello Yuan, Thank you for your excellent code, in your paper you mentioned that the AST model can support variable-length inputs, but I noticed that the following parts of the code didn't seem to support variable-length input: https://github.com/YuanGongND/ast/blob/6f4e1931ced642c23d0b7aa3196a45043dec3c8d/src/models/ast_models.py#L188

So how can you solve the above problems?

--obsidian

YuanGongND commented 2 years ago

It is a great question.

What we mean in the paper is that AST works for tasks with variable input lengths (i.e., you don't need to change the AST architecture for tasks with different input lengths), but for a specific task, you need to set a fixed input length when you initialize AST model as input_tdim is a required parameter. We use 1024 for AudioSet, 512 for ESC-50, etc. That helps the creation/adaptation (if use ImageNet pretrained model) of positional embedding. For new tasks, you can set the t_dim as the max length (or the length that is sufficient for most inputs) of your dataset and pad zeros for short audios.

-Yuan

ooobsidian commented 2 years ago

Thank you for your answer!

-obsidian

YuanGongND commented 2 years ago

But I think with some code modification (mainly cut/pad the positional embedding according to the input length), AST can support training/eval for mini-batches with different lengths, just like other Transformer models. That is mainly for efficiency consideration, so you can first group up your input of similar-length to mini batches (e.g., if you have a dataset with 4 samples, each of lengths 5, 5, 9, 10, you can group the first two samples as a mini-batch with size 5, and the last two samples as a mini-batch with size 10) and then input the mini-batches to AST. You need to pad samples to the same length within each mini-batch though. I think that would help if your inputs have very different lengths in your dataset.

Due to the workload, we don't plan to add this feature in this repo though, hope that is understandable.

ooobsidian commented 2 years ago

I see, thank you for your patience.