In your code file ViTAE-VSA\Image-Classification\vitaev2_vsa\NormalCell.py L130:
self.pos = nn.Conv2d(dim, dim, window_size//2*2+1, 1, window_size//2, groups=dim, bias=True)
your window_sizeis 7,so the self.pos convolution kernel is 7 too, in most Positional Encoding extractor it is so large.
So is it possible that the positional encoding rather than VSA is working ?
In your code file
ViTAE-VSA\Image-Classification\vitaev2_vsa\NormalCell.py
L130:self.pos = nn.Conv2d(dim, dim, window_size//2*2+1, 1, window_size//2, groups=dim, bias=True)
yourwindow_size
is 7,so theself.pos
convolution kernel is 7 too, in most Positional Encoding extractor it is so large.So is it possible that the positional encoding rather than VSA is working ?