Alibaba-MIIL / STAM

Official implementation of "An Image is Worth 16x16 Words, What is a Video Worth?" (2021 paper)
Apache License 2.0
219 stars 31 forks source link

Linear Projection #5

Closed ShuvenduRoy closed 3 years ago

ShuvenduRoy commented 3 years ago

Dear researchers,

Thank you for this great work!

I have a confusion about the linear projection. As of the paper, "We design a convolution-free model that is fully based on self-attention blocks for the spatiotemporal domain" So, I was expecting no Conv block in the implementation. But I see a Conv2D in the linear projection. https://github.com/Alibaba-MIIL/STAM/blob/master/src/models/transformer_model.py#L224

Can you provide some explanation on this?

ShuvenduRoy commented 3 years ago

Hi,

I was expecting some answer from you.

giladsharir commented 3 years ago

Hi, the conv operation in the patch embedding is equivalent to dividing the image into patches and doing a linear projection for each patch (since its a convolution with stride = patch size ).