Closed ShuvenduRoy closed 3 years ago
Hi,
I was expecting some answer from you.
Hi, the conv operation in the patch embedding is equivalent to dividing the image into patches and doing a linear projection for each patch (since its a convolution with stride = patch size ).
Dear researchers,
Thank you for this great work!
I have a confusion about the linear projection. As of the paper, "We design a convolution-free model that is fully based on self-attention blocks for the spatiotemporal domain" So, I was expecting no Conv block in the implementation. But I see a Conv2D in the linear projection. https://github.com/Alibaba-MIIL/STAM/blob/master/src/models/transformer_model.py#L224
Can you provide some explanation on this?