ViTAE-Transformer / ViTAE-VSA

The official repo for [ECCV'22] "VSA: Learning Varied-Size Window Attention in Vision Transformers"
https://arxiv.org/abs/2204.08446
157 stars 9 forks source link

Is it possible the positional encoding rather than VSA works? #3

Open RebornForPower opened 2 years ago

RebornForPower commented 2 years ago

In your code file ViTAE-VSA\Image-Classification\vitaev2_vsa\NormalCell.py L130: self.pos = nn.Conv2d(dim, dim, window_size//2*2+1, 1, window_size//2, groups=dim, bias=True) your window_sizeis 7,so the self.pos convolution kernel is 7 too, in most Positional Encoding extractor it is so large.

So is it possible that the positional encoding rather than VSA is working ?

RebornForPower commented 2 years ago

@Roger-QMZhang