YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Temporal organization of tokens #46

Closed MikeKras closed 2 years ago

MikeKras commented 2 years ago

Hi!

To start with - great work with the model and thanks for sharing!

I already ran it for standard classification cases and it worked as expected. However, now I want to treat the network's outputs as a sequence organized by time dimension. I have a few points / questions related to that:

  1. I noticed in your paper that you tried the 128 x 2 input patches in your ablation studies - do you have the weights saved and would you be willing to share them? Maybe despite worse results they could be useful in my case.
  2. You mentioned that the 128 x 2 trained better on purely on AudioSet. Have you considered also pretraining on ImageNet using those parameters? Was the reason behind not checking this computational complexity or something else (e.g. you believe that it wouldn't train well on ordinary image data)?
  3. Do you see a way to use the majority of the network as it currently is (with 16 x 16 input patches) and adding some layer (e.g. conv1D) at the top to make it combine outputs corresponding to specific time frames? How would you approach this?

Thanks! Michał

YuanGongND commented 2 years ago

Hi Michał,

Thanks for the question (and for helping me answer another question).

First, regarding temporal order output, we have a new work that uses audio self-supervised pretraining to replace ImageNet pretraining, which supports temporal order output (in other words, it is ideal for frame-level audio representation extraction). We hope that can be released soon, I will keep you posted.

For your questions:

I noticed in your paper that you tried the 128 x 2 input patches in your ablation studies - do you have the weights saved and would you be willing to share them? Maybe despite worse results they could be useful in my case.

Yes, but that is only trained on the balanced (small) AudioSet and performs much worse than the square-patch-based AST. With the limited time I have, I plan to release the self-supervised AST mentioned above rather than this checkpoint (also this checkpoint is trained with an old version of code and cannot be directly used).

You mentioned that the 128 x 2 trained better on purely on AudioSet. Have you considered also pretraining on ImageNet using those parameters? Was the reason behind not checking this computational complexity or something else (e.g. you believe that it wouldn't train well on ordinary image data)?

The main reason is ImageNet pretraining is quite expensive and complex.

Do you see a way to use the majority of the network as it currently is (with 16 x 16 input patches) and adding some layer (e.g. conv1D) at the top to make it combine outputs corresponding to specific time frames? How would you approach this?

I think it is possible, you can just re-organize the output, for example, average the patches belonging to the same time frames.

But the time resolution would be low, e.g., if the original model uses 16x16 patches with an overlap of 6 on the time dimension, then the time resolution will be 100ms after re-organizing the output (the hop of frames is 10ms, 10x10ms=100ms). Again, we believe the self-supervised AST is a better choice as it supports arbitrary patch size and shape, the time resolution can be 10ms for a 1x128 patch.

-Yuan

MikeKras commented 2 years ago

Hi Yuan!

First, regarding temporal order output, we have a new work that uses audio self-supervised pretraining to replace ImageNet pretraining, which supports temporal order output (in other words, it is ideal for frame-level audio representation extraction). We hope that can be released soon, I will keep you posted.

Sounds awesome! I'm definitely looking forward to it, it should be very useful. Btw. if there is something there that I could assist with, I'd be happy to.

I think it is possible, you can just re-organize the output, for example, average the patches belonging to the same time frames. But the time resolution would be low, e.g., if the original model uses 16x16 patches with an overlap of 6 on the time dimension, then the time resolution will be 100ms after re-organizing the output (the hop of frames is 10ms, 10x10ms=100ms). Again, we believe the self-supervised AST is a better choice as it supports arbitrary patch size and shape, the time resolution can be 10ms for a 1x128 patch.

You're right, though I expect such resolution to be sufficient for my current needs. I will consider postponing the implementation of this approach for now and maybe by the time I return to it, the SSAST will be out.

Thanks for the answers! Michał