VAN backbone performance?

Visual-Attention-Network / SegNeXt

Official Pytorch implementations for "SegNeXt: Rethinking Convolutional Attention Design for Semantic Segmentation" (NeurIPS 2022)

Apache License 2.0

795 stars 85 forks source link

VAN backbone performance? #4

Closed iumyx2612 closed 2 years ago

iumyx2612 commented 2 years ago

VAN-Small + Light-Ham-D256 with 15.8GFlops and 13.8M Params achieves 45.2mIoU on ADE20K: here
MSCAN-S + Light-Ham-D512 with 16.0GFlops and 14.0M Params achieves 44.3mIoU on ADE20K

VAN-Base + Light-Ham with 34.4GFlops and 27.4M Params achieves 49.6mIoU on ADE20K: here
MSCAN-B + Light-Ham with 35.0GFlops and 28.0M Params achieves 48.5mIoU on ADE20K

What is the need of MSCAN backbone? The paper explain that "Though VAN has achieved great performance in image classification, it neglects the role of multi-scale feature aggregation during the network design, which is crucial for segmentation-like tasks", however that's vanilla VAN without Light-Ham decoder

MenghaoGuo commented 2 years ago

For Ham repo, it reports the multi-scale and flip testing result.

For fair comparison, it should compare 45.2 v.s. 45.8 and 49.6 v.s. 49.9. And for more models like Tiny and Large, SegNeXt also has an improvement.

If we test the single scale inference result, we can find SegNeXt has a more visible progress.

iumyx2612 commented 2 years ago

For Ham repo, it reports the multi-scale and flip testing result.

For fair comparison, it should compare 45.2 v.s. 45.8 and 49.6 v.s. 49.9. And for more models like Tiny and Large, SegNeXt also has an improvement.

If we test the single scale inference result, we can find SegNeXt has a more visible progress.

Thank you!
So MSCAN can leverage multi-scale feature through InceptionNet-like block while VAN can't right?

Please correct me if I'm wrong on this. I see LKA and MSCA is really different from normal proposed visual attention like SENet, BAM, CBAM,... LKA and MSCA multiply each single pixel in feature maps with a weight (a 3D feature maps (C, H, W) multiply with a 3D weight (C, H, W)) instead of normal 3D feature maps (C, H, W) multiply with 2D weight (C, 1, 1) or (1, H, W) like SENet, BAM, CBAM. Am I correct on this one?

MenghaoGuo commented 2 years ago

Yes.

VAN is a strong backbone for various vision tasks. I think VAN is the proudest work I have done so far.

About the differences between VAN and SegNeXt:

1) From MSCAN view, it introduces multi-scale information, which is important for dense prediction task. Besides, strip decomposition of large kernel convolution seems more friendly to classification and segmentation tasks.

2) From SegNeXt, it not only contains an encoder, but also introduces how to conduct a suitable decoder in detail. We have made a lot of attempts in the decoder stage to make SegNeXt effective and lightweight including the struture and algorithm (Figure 3 in SegNeXt). After some attempts, we choose current decoder, as you can see in the SegNeXt and Ham repo.

MenghaoGuo commented 2 years ago

Both LKA and MSCA have a 3D attention map like Residual Attention .

iumyx2612 commented 2 years ago

Both LKA and MSCA have a 3D attention map like Residual Attention .

Thank you for your fast response. I have never seen this idea before studying about Attention in CNN. I'll take a look into it!