One question regarding the linear projection of AST.

Dear Minister Gong,

I wanted to express my gratitude for your work on AST; it has truly been an inspiration to me. I can confidently say that AST has served as my enlightenment teacher. However, I do have a question regarding the linear projection aspect of AST.

In the traditional methods like ViT and DeiT, the image is divided into patches (typically 16x16), and each patch is then passed through a linear projection, often implemented as a fully connected layer (torch.nn.Linear). However, upon examining the AST's code in ast_models.py, line 29, I noticed that a CNN layer (torch.nn.Conv2d) is used for the linear projection. It appears that all three models (ViT, DeiT, and AST) refer to this as the linear projection.

My question is: Does this imply that the linear projection can be either a CNN or a fully connected layer?

Please correct me if I have misunderstood anything.

YuanGongND / ast

One question regarding the linear projection of AST. #127