Closed JiaquanYe closed 2 years ago
This is an interesting paper, which gives a systematic understanding about why vision Transformers exhibit strong robustness against various corruptions. I have a question after I read this paper.
As I known, ConvNeXt is composite of ConvNeXt Block, which include depthwise convolution and pointwise convolution. It is different with the Visual Transformer architecture, which use MLP Block to aggregate the information from MHSA Block. After I read this paper, I think FAN is reweighting channels in MLP Block.
So I curious about, how do we implement FAN Block in ConvNeXt?
Hi Jiaquan,
Thanks for your interesting question. You are right that CNNs typically process the spatial information and channel aggregation within a single operator (if you do not implement in terms of fold & unfold operator). In that case, the CSA module can be directly applied on the output of the ConvNeXt block, in a similar manner as of SE attention.
Based on our observation, it is quite promising to apply CSA on CNN based models. If you try it and got some results, we do appreciate it if you could share it along this thread.
Best regards, Zhou Daquan
Hi, I have read the source code, and I noticed that you use SE block in MLP of FAN. And ECA is mentioned in your paper but not implemented in the source code. Is it SE-MLP perform better than ECA-MLP?
Hi, I have read the source code, and I noticed that you use SE block in MLP of FAN. And ECA is mentioned in your paper but not implemented in the source code. Is it SE-MLP perform better than ECA-MLP?
Hi,
Thanks for your interest. Please note that the SEMlp block is only used when use_se is set to be True. We only use this one in the model named as fan_small_12_p16_224_se_attn. For all other models, we use ECA as a default channel attention module which is included in the TokenMixing class.
The detailed comparison between the impacts of ECA and SE are shown in Table 6 in the paper. Where, ECA significantly outperforms the SE attention. We believe this gain comes from the fact that ECA takes the spatial relationship into consideration.
I hope this clarifies your question.
fan_small_12_p16_224_se_attn
Yes, I have found it in ChannelProcessing class.
fan_small_12_p16_224_se_attn
Yes, I have found it in ChannelProcessing class.
You can refer to this link: https://github.com/NVlabs/FAN/blob/6467085dbc6a410a616e17a36fdfa375d21770bc/models/fan.py#L432 for the channel processing class. There is no SE modules used in the class.
This is an interesting paper, which gives a systematic understanding about why vision Transformers exhibit strong robustness against various corruptions. I have a question after I read this paper.
As I known, ConvNeXt is composite of ConvNeXt Block, which include depthwise convolution and pointwise convolution. It is different with the Visual Transformer architecture, which use MLP Block to aggregate the information from MHSA Block. After I read this paper, I think FAN is reweighting channels in MLP Block.
So I curious about, how do we implement FAN Block in ConvNeXt?