In the paper we only explored using axial-self-attention as a backbone, but one can certainly explore extending it to non-self-attention or cross-attention, e.g. attending from a 2D map to another 2D map.
The current code only supports global attention and the same train and eval resolution. But in general, axial-attention is not limited to different input resolutions: one should use local axial-attention with a fixed span (e.g. 65) for that. In the paper, we used span = (65x65) for the main panoptic segmentation results, and this allows us to do multi-scale inference with different input resolutions.
您好!想请问一下,如果我不是用AxialAttention作为backbone,而是仅仅用它作为一个attention机制的话,效果怎么样?此外,AxialAttention中的kernel size是不是取决于输入的feature map的尺寸,那么如果我在train和infer的时候输入尺寸不同,feature map尺寸也不同,是不是会出现问题呢?希望能够得到解答,非常感谢!