YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

Using pretrained model for embeddings extraction with audio input samples of different durations. #101

Open sreenivasaupadhyaya opened 1 year ago

sreenivasaupadhyaya commented 1 year ago

Hi @YuanGongND ,

Thank you for the great work. I am trying to use the AST model for extracting embeddings for audio events of 1s. To begin with, i started to play around with https://github.com/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb notebook and replaced the audio file to my file of 1s duration. I see the model expects 1024 input_tdim and mine is a shorter event and hence i see rest of the input is zero padded. Does this give optimal embeddings? image

do you suggest any modifications, My goal is to do linear probing.

Thanks in advance. Regrds, Srini

YuanGongND commented 1 year ago

hi there,

https://github.com/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb notebook and replaced the audio file to my file of 1s duration.

Yes, this is recommended. This script also outputs sound class, does it predict correct?

I see the model expects 1024 input_tdim and mine is a shorter event and hence i see rest of the input is zero padded. Does this give optimal embeddings?

The AST model use two [cls] tokens (following DeiT), so it is expected to ignore padded segment. So getting the average of the two [cls] token outputs as embedding might be close to optimal.

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L184-L186

Otherwise if you want to trim the length to 1s, it is also doable, please see our speechcommends recipe at https://github.com/YuanGongND/ast/tree/master/egs/speechcommands. Some effort is needed. If the colab script predicts the sound class correctly, probably it is safe to just get the average of two [cls] token as the embedding, it should be very easy. Just let the model return after this line:

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L184

-Yuan

sreenivasaupadhyaya commented 1 year ago

hi there,

https://github.com/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb notebook and replaced the audio file to my file of 1s duration.

Yes, this is recommended. This script also outputs sound class, does it predict correct?

I see the model expects 1024 input_tdim and mine is a shorter event and hence i see rest of the input is zero padded. Does this give optimal embeddings?

The AST model use two [cls] tokens (following DeiT), so it is expected to ignore padded segment. So getting the average of the two [cls] token outputs as embedding might be close to optimal.

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L184-L186

Otherwise if you want to trim the length to 1s, it is also doable, please see our speechcommends recipe at https://github.com/YuanGongND/ast/tree/master/egs/speechcommands. Some effort is needed. If the colab script predicts the sound class correctly, probably it is safe to just get the average of two [cls] token as the embedding, it should be very easy. Just let the model return after this line:

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L184

-Yuan

Thanks for the comments.

  1. In this framework, Does it mean the output embedding size of the event will be same if the event length is 1s or 10s?
  2. Is there a method to adapt to get framewise embeddings, For eg: a 10s clip will have 10x more frames than 1s audio. This way we can build more efficient downstream task to do necessary classification.

Regards, Srini

YuanGongND commented 1 year ago

In this framework, Does it mean the output embedding size of the event will be same if the event length is 1s or 10s?

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L184

Before this line, x is in shape [batch_size, sequence_length, feature_dimension]. sequence_length is the number of patches. 1s or 10s only make a difference for sequence_length. So if you use the cls tokens by x = (x[:, 0] + x[:, 1]) / 2, there's no difference between 1s and 10s. Also,if you choose to pad 1s to 10s, then there's also no difference for sequence_length.

Is there a method to adapt to get framewise embeddings, For eg: a 10s clip will have 10x more frames than 1s audio. This way we can build more efficient downstream task to do necessary classification.

Instead of use [cls] tokens, you can use patch-wise tokens by x = x[:, 2:] (remove the heading 2 cls tokens). However, it is patch-wise, not frame-wise, you will still need to reshape and mean pool the frequency domain to get frame-level embedding.

Note this is only for AST, SSAST do not use cls tokens.

-Yuan

sreenivasaupadhyaya commented 12 months ago

Thanks for the clarification :)

sreenivasaupadhyaya commented 6 months ago

Hi @YuanGongND ,

My apologies to ask this question this late!, I experimented based on your suggestion and things work well. I was curious about your statement,

"However, it is patch-wise, not frame-wise, you will still need to reshape and mean pool the frequency domain to get frame-level embedding." - I understand what you meant by reshaping and it comes from the breakdown of the input spectrogram into patches. But could you explain me in short what is "mean pool of frequency". For eg: if my spectrogram is split into 8 patches as shown in the AST paper. how will I mean pool the output embeddings.

Thanks in advance.