Closed LMMMEng closed 1 year ago
Thank you for your interest to our work!
Great question, in our experiments, we won't say that DCS is better than MLP. However, DCS can attain similar or a little bit higher performance than MLP when the block depth is small (e.g. 2). As you can see from the code, we expend the channel size by 4 in DCS, which means they still have cross-channel communication during expansion and we assume that redundant context may extract in each channel. For MLP, it works a little bit better with our experiment if we increase the block depth (e.g. 8, 12), having similar scenario in the natural imaging domain model (e.g. swin transformer, RepLKNet).
Therefore, in order to push further for having less model parameters while achieving similar performance, we would like to use DCS.
Get it. Many thanks for your reply!
It's very surprising that DCS or call it stacked pointwise depthwise convolution because it does nothing even similar to MLP. This kind of convolution is rarely used due to a lack of computational function. It is a just simple single-element linear transformation that doesn't even need a network layer to achieve(e.g. This kind of simple transformation can be easily inserted into a normal convolution layer). And I'm very interested in the comparison between "no-MLP with nl" and "DCS", reserving all the nonlinearity that the network has in the no-MLP condition. Could you present that? Because I think a lot of readers will have the same kind of question as I do.
Thank you for your interest! Is the nl in "no-MLP with nl" referring to the non-linear activation? If yes, I definitely can have a try on this scenario and focus on what is going on there.
Thank you for your interest! Is the nl in "no-MLP with nl" referring to the non-linear activation? If yes, I definitely can have a try on this scenario and focus on what is going on there.
Yes. I hypothesize that the performance gain between no-MLP and DSC mainly comes from nonlinearity. If yes, I guess the network parameters can be further compressed.
Got it, will try the scenario afterwards and further discuss the performance here. Really thank you for your suggestions towards our work.
Thanks for your excellent work!
From the code, the difference between the uxnet blk and the convnext blk is in the pwconv part, where the author uses group conv here (group=dim). However, such an operation lacks cross-channel communications, so I don't understand that in the case of the same number of channels, why is DCS better than MLP