Closed ooobsidian closed 3 years ago
It is a great question.
What we mean in the paper is that AST works for tasks with variable input lengths (i.e., you don't need to change the AST architecture for tasks with different input lengths), but for a specific task, you need to set a fixed input length when you initialize AST model as input_tdim
is a required parameter. We use 1024 for AudioSet, 512 for ESC-50, etc. That helps the creation/adaptation (if use ImageNet pretrained model) of positional embedding. For new tasks, you can set the t_dim as the max length (or the length that is sufficient for most inputs) of your dataset and pad zeros for short audios.
-Yuan
Thank you for your answer!
-obsidian
But I think with some code modification (mainly cut/pad the positional embedding according to the input length), AST can support training/eval for mini-batches with different lengths, just like other Transformer models. That is mainly for efficiency consideration, so you can first group up your input of similar-length to mini batches (e.g., if you have a dataset with 4 samples, each of lengths 5, 5, 9, 10, you can group the first two samples as a mini-batch with size 5, and the last two samples as a mini-batch with size 10) and then input the mini-batches to AST. You need to pad samples to the same length within each mini-batch though. I think that would help if your inputs have very different lengths in your dataset.
Due to the workload, we don't plan to add this feature in this repo though, hope that is understandable.
I see, thank you for your patience.
Hello Yuan, Thank you for your excellent code, in your paper you mentioned that the AST model can support variable-length inputs, but I noticed that the following parts of the code didn't seem to support variable-length input: https://github.com/YuanGongND/ast/blob/6f4e1931ced642c23d0b7aa3196a45043dec3c8d/src/models/ast_models.py#L188
So how can you solve the above problems?
--obsidian