similarities between the weighted sum approach in self-attention and the convolution per-channel basis.

Thank you for your interest to our work. Great question! For swin transformer approach, the computation of self-attention within a window is completely similar to the convolution per-channel basis. For example, you have a window of 7x7 with swin transformer, it will further divided into sub-windows to compute the self-attention for more fine-grain details. However, they will be weighted-sum together and similar to the depthwise convolution approach (performing convolution each channel independely). That's the similarity between the swin transformer block and convolution block

MASILab / 3DUX-Net

similarities between the weighted sum approach in self-attention and the convolution per-channel basis. #49