MrBlankness / LightM-UNet

Pytorch implementation of "LightM-UNet: Mamba Assists in Lightweight UNet for Medical Image Segmentation"
https://arxiv.org/abs/2403.05246
Apache License 2.0
279 stars 25 forks source link

A question about the deep separable convolutional layer #2

Closed dalaeyelids closed 7 months ago

dalaeyelids commented 7 months ago

A very excellent job. May I ask if it is possible to introduce the reason why deep separable convolutional layer was first used in LightM-UNet

dalaeyelids commented 7 months ago

image

MrBlankness commented 7 months ago

A very excellent job. May I ask if it is possible to introduce the reason why deep separable convolutional layer was first used in LightM-UNet

Thank you for your attention and recognition of LightM-UNet. In LightM-UNet, we employ depthwise convolution for several reasons. Existing visual theories suggest that the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture. In comparison to architectures like Mamba or Transformer, CNNs are better suited for this task. However, as the network delves deeper and needs to extract deeper features of images, such as semantic features, it requires modeling long-range spatial dependencies. CNNs, constrained by their receptive fields, struggle to model this characteristic effectively.

To minimize the parameter count of the model, we utilize depthwise convolution instead of the naive convolution method.

dalaeyelids commented 7 months ago

A very excellent job. May I ask if it is possible to introduce the reason why deep separable convolutional layer was first used in LightM-UNet

Thank you for your attention and recognition of LightM-UNet. In LightM-UNet, we employ depthwise convolution for several reasons. Existing visual theories suggest that the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture. In comparison to architectures like Mamba or Transformer, CNNs are better suited for this task. However, as the network delves deeper and needs to extract deeper features of images, such as semantic features, it requires modeling long-range spatial dependencies. CNNs, constrained by their receptive fields, struggle to model this characteristic effectively.

To minimize the parameter count of the model, we utilize depthwise convolution instead of the naive convolution method.

Thank you very much for your answer, which has greatly benefited me. May I ask if there are any papers that can support this viewpoint“the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture”

MrBlankness commented 7 months ago

A very excellent job. May I ask if it is possible to introduce the reason why deep separable convolutional layer was first used in LightM-UNet

Thank you for your attention and recognition of LightM-UNet. In LightM-UNet, we employ depthwise convolution for several reasons. Existing visual theories suggest that the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture. In comparison to architectures like Mamba or Transformer, CNNs are better suited for this task. However, as the network delves deeper and needs to extract deeper features of images, such as semantic features, it requires modeling long-range spatial dependencies. CNNs, constrained by their receptive fields, struggle to model this characteristic effectively. To minimize the parameter count of the model, we utilize depthwise convolution instead of the naive convolution method.

Thank you very much for your answer, which has greatly benefited me. May I ask if there are any papers that can support this viewpoint“the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture”

Perhaps you can refer to 'SwinIR: Image Restoration Using Swin Transformer' and related works.

dalaeyelids commented 7 months ago

A very excellent job. May I ask if it is possible to introduce the reason why deep separable convolutional layer was first used in LightM-UNet

Thank you for your attention and recognition of LightM-UNet. In LightM-UNet, we employ depthwise convolution for several reasons. Existing visual theories suggest that the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture. In comparison to architectures like Mamba or Transformer, CNNs are better suited for this task. However, as the network delves deeper and needs to extract deeper features of images, such as semantic features, it requires modeling long-range spatial dependencies. CNNs, constrained by their receptive fields, struggle to model this characteristic effectively. To minimize the parameter count of the model, we utilize depthwise convolution instead of the naive convolution method.

Thank you very much for your answer, which has greatly benefited me. May I ask if there are any papers that can support this viewpoint“the shallow layers of a network extract shallow features of images, such as color and texture, which do not necessarily require complex global information to capture”

Perhaps you can refer to 'SwinIR: Image Restoration Using Swin Transformer' and related works.

Thank you very much for your reply.