YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

One question regarding the linear projection of AST. #127

Closed poult-lab closed 4 weeks ago

poult-lab commented 3 months ago

Dear Minister Gong,

I wanted to express my gratitude for your work on AST; it has truly been an inspiration to me. I can confidently say that AST has served as my enlightenment teacher. However, I do have a question regarding the linear projection aspect of AST.

In the traditional methods like ViT and DeiT, the image is divided into patches (typically 16x16), and each patch is then passed through a linear projection, often implemented as a fully connected layer (torch.nn.Linear). However, upon examining the AST's code in ast_models.py, line 29, I noticed that a CNN layer (torch.nn.Conv2d) is used for the linear projection. It appears that all three models (ViT, DeiT, and AST) refer to this as the linear projection.

My question is: Does this imply that the linear projection can be either a CNN or a fully connected layer?

Please correct me if I have misunderstood anything.

YuanGongND commented 2 months ago

hi there,

https://github.com/YuanGongND/ast/blob/31088be8a3f6ef96416145c4b8d43c81f99eba7a/src/models/ast_models.py#L29

This is not just linear projection, it is splitting the spectrogram into patches and then do linear projection (which is implemented using a CNN layer).

-Yuan