5g4s / paper

0 stars 0 forks source link

GhostNetV2: Enhance Cheap Operation with Long-Range Attention #36

Open 5g4s opened 1 year ago

5g4s commented 1 year ago

https://arxiv.org/abs/2211.12905

5g4s commented 1 year ago

The convolutional operation can only capture local information in a window region, which prevents performance from being further improved. Introducing self-attention into convolution can capture global information well, but it will largely encumber the actual speed.

5g4s commented 1 year ago

In this paper, we propose a hardware-friendly attention mechanism (dubbed DFC attention) and then present a new GhostNetV2 architecture for mobile applications.

5g4s commented 1 year ago

Problem

The convolution-based light-weight models are weak in modeling long-range dependency, which limits further performance improvement.

Recently, transformer-like models are introduced to computer vision, in which the self-attention module can capture the global information. The typical self-attention module requires quadratic complexity w.r.t. the size of feature’s shape and is not computationally friendly. Moreover, plenty of feature splitting and reshaping operations are required to calculate the attention map. Though their theoretical complexity is negligible, these operations incur more memory usage and longer latency in practice.

5g4s commented 1 year ago

A mainstream strategy to reduce attention’s complexity is splitting images into multiple windows and implementing the attention operation inside windows or crossing windows. For example, Swin Transformer [21] splits the original feature into multiple non-overlapped windows, and the self attention is calculated within the local windows.

5g4s commented 1 year ago

For an intuitive understanding, we equip the GhostNet model with the self-attention used in MobileViT [23] and measure the latency on Huawei P30 (Kirin 980 CPU) with TFLite tool.