Westlake-AI / MogaNet

[ICLR 2024] MogaNet: Efficient Multi-order Gated Aggregation Network
https://arxiv.org/abs/2211.03295
Apache License 2.0
162 stars 13 forks source link

Why use Hadamard product in Spatial Aggregation? #18

Open jing-zhao9 opened 3 months ago

jing-zhao9 commented 3 months ago

2c4436b1-73c3-4cb9-9ffa-8312b8790a92

May I ask why concatenation is not used for feature aggregation in the Spatial Aggregation block?

Lupin1998 commented 3 months ago

Hi, @jing-zhao9, thanks for your question! This dot production of two branches is one of the efficient designs first proposed in MogaNet, which is also used in Mamba and its recently proposed variants. We called gating according to GLU and found it more powerful than other additive operations like the concatenation you mentioned. You might find an intuitive explanation of why gating operations are effective and efficient in StarNet. Feel free to discuss if there are more questions.

jing-zhao9 commented 3 months ago

Thank you for your careful answer!Thank you for your careful explanation! I have another question: why do I encounter gradient explosion when I apply the dot product proposed in MogaNet to my baseline model for training?

Lupin1998 commented 2 weeks ago

Sorry for the late reply. The gradient explosion might sometimes occur in MogaNet because of the gating branch in the Moga module. There are two possible ways: (1) Checking the NAN or Inf during training. If the gradient explosion occurs, resume the training at the previous checkpoint. (2) Removing the SiLU in the branch with multiple DWConv. Two SiLU activation functions provide strong non-linearity with small parameters but increase the risk of instability. You might trade-off the performance and training stability.