Westlake-AI / MogaNet

[ICLR 2024] MogaNet: Efficient Multi-order Gated Aggregation Network
https://arxiv.org/abs/2211.03295
Apache License 2.0
162 stars 13 forks source link

What is "trivial interactions" mentioned in the paper? #8

Closed iumyx2612 closed 1 year ago

iumyx2612 commented 1 year ago

In paper, authors wrote " we propose FD(·) to dynamically exclude trivial interactions" and "By re-weighting the trivial interaction component Y − GAP(Y ), FD(·) also increase feature diversities"

What exactly is this "trivial interactions"? And why taking Y - GAP(Y) can increase feature diversities?

Lupin1998 commented 1 year ago

Hi, @iumyx2612, thanks for your question and using MogaNet. Sorry for the late reply. Please refer to issue #6 for a detailed answer.

As for the first question, the trivial interactions are derived from the definition in Representation Bottleneck. Given an image of n patches, the m-order interactions are defined by Eq.1 and Eq. 2 in Representation Bottleneck, where $0\ge m\le n-2$. There are two trivial conditions: (a) When the interaction pair i,j is the same patch, it will be 0-order interaction that the patch interacts with itself, i.e., $Conv_{1\times 1}(\cdot)$. (b) When all patches are global average pooled GAP($\cdot$) into a single token, it cannot be measured by the concept of m-order interaction (because the contextual set cannot be larger than m-2).

As for the second question, $Y-GAP(Y)$ denotes high-frequency components of the feature map $Y$. Therefore, the model can learn to reweight high frequencies to generate diverse features.

iumyx2612 commented 1 year ago

Hi, @iumyx2612, thanks for your question and using MogaNet. Sorry for the late reply. Please refer to issue #6 for a detailed answer.

As for the first question, the trivial interactions are derived from the definition in Representation Bottleneck. Given an image of n patches, the m-order interactions are defined by Eq.1 and Eq. 2 in Representation Bottleneck, where 0≥m≤n−2. There are two trivial conditions: (a) When the interaction pair i,j is the same patch, it will be 0-order interaction that the patch interacts with itself, i.e., Conv1×1(⋅). (b) When all patches are global average pooled GAP(⋅) into a single token, it cannot be measured by the concept of m-order interaction (because the contextual set cannot be larger than m-2).

As for the second question, Y−GAP(Y) denotes high-frequency components of the feature map Y. Therefore, the model can learn to reweight high frequencies to generate diverse features.

Thank you! I'll take a look at Representation Bottleneck for further understanding!

iumyx2612 commented 1 year ago

Hi, @iumyx2612, thanks for your question and using MogaNet. Sorry for the late reply. Please refer to issue #6 for a detailed answer.

As for the first question, the trivial interactions are derived from the definition in Representation Bottleneck. Given an image of n patches, the m-order interactions are defined by Eq.1 and Eq. 2 in Representation Bottleneck, where 0≥m≤n−2. There are two trivial conditions: (a) When the interaction pair i,j is the same patch, it will be 0-order interaction that the patch interacts with itself, i.e., Conv1×1(⋅). (b) When all patches are global average pooled GAP(⋅) into a single token, it cannot be measured by the concept of m-order interaction (because the contextual set cannot be larger than m-2).

As for the second question, Y−GAP(Y) denotes high-frequency components of the feature map Y. Therefore, the model can learn to reweight high frequencies to generate diverse features.

Hi, I understand the first answer. For the 2nd answer, why does $Y−GAP(Y)$ denotes high freq components?

Lupin1998 commented 1 year ago

Hello, @iumyx2612, sorry for the late reply. In an image or a feature map, the low-frequency components contribute around 90\% of energy to the spectrum, especially the direct-current (DC) component. As we know that $GAP(Y)$ represents the DC component of the feature map, $Y-GAP(Y)$ denotes the rest of the frequency components. Since convolution operations are more likely to learn high-frequency features, we can regard $Y-GAP(Y)$ as the high-frequency components in comparison to the raw feature maps. I hope this explanation would help you.

iumyx2612 commented 1 year ago

Hello, @iumyx2612, sorry for the late reply. In an image or a feature map, the low-frequency components contribute around 90% of energy to the spectrum, especially the direct-current (DC) component. As we know that GAP(Y) represents the DC component of the feature map, Y−GAP(Y) denotes the rest of the frequency components. Since convolution operations are more likely to learn high-frequency features, we can regard Y−GAP(Y) as the high-frequency components in comparison to the raw feature maps. I hope this explanation would help you.

Can you elaborate the statement "the low-frequency components contribute around 90% of energy to the spectrum"

To my understanding, the feature maps has much low-freq or high-freq components is based on the characteristic of that module:

Lupin1998 commented 1 year ago

Thanks for your question and suggestion. As you have pointed out, the feature maps of ConvNets tend to contain more high-frequency components when the network layer goes deeper (like Figure 2 (a) ResNet in How Do Vision Transformers Work). It is true that the low frequencies (especially the DC component) contribute around 90% to the spectrum for a raw image or the early layer but does not hold for the deep layers. Therefore, the explanation of the proposed $Y-GAP(Y)$ is only empirical or for comprehension.

iumyx2612 commented 1 year ago

Thanks for your question and suggestion. As you have pointed out, the feature maps of ConvNets tend to contain more high-frequency components when the network layer goes deeper (like Figure 2 (a) ResNet in How Do Vision Transformers Work). It is true that the low frequencies (especially the DC component) contribute around 90% to the spectrum for a raw image or the early layer but does not hold for the deep layers. Therefore, the explanation of the proposed Y−GAP(Y) is only empirical or for comprehension.

I understand, thank you!!!