YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

seq2seq classification with AST #117

Open YSLCoat opened 7 months ago

YSLCoat commented 7 months ago

Hi!

Is it trivial to adapt the AST architecture to do sequence to sequence classification? My input data has a label for each audio sample and my goal is to classify each sample in the data.

YuanGongND commented 6 months ago

Can you take a look at Figure 1 of this paper https://arxiv.org/pdf/2305.10790.pdf to see an example to mean pool over the frequency dimension to get representation in temporal order? Code implementation is here:

https://github.com/YuanGongND/ltu/blob/c2d0723c9f31a54eb2c2b62c5cc030b25317dc6f/src/ltu/hf-dev/transformers-main/src/transformers/models/llama/modeling_llama.py#L668-L672

However, the code is for "no-overalp" patch split, apply to "overlapped" patch split (in this repo) requires some change.

You can also check SSAST which supports naive temporal order representation. https://github.com/YuanGongND/ssast

When you have temporal order representation, you can do seq2seq tasks, e.g., add a CTC on top of the temporal representations.

-Yuan

YSLCoat commented 6 months ago

I got the results I wanted by removing line 184 in https://github.com/YuanGongND/ast/blob/master/src/models/ast_models.py and setting t_stride = 1, I think that should give me a working seq2seq classification. I will take a look at the links you provided as well!