Pelee: A Real-Time Object Detection System on Mobile Devices

동기

모바일 Device에서의 CNN을 running할 이유가 많아졌음.
즉, 제한된 computing 파워와 메모리 리소스 문제로 효율적인 모델 디자인이 필요.
기존 연구에서는 MobileNet, ShuffleNet, MobileNetV2등이 있었음.
- 그러나, depthwise separable convolution에 크게 의존하고 있어서 구현상 문제가 많음.
그래서, PeleeNet 이라는 CNN 모델을 제한
- s a higher accuracy and over 1.8 times faster speed than MobileNet and MobileNetV2 on NVIDIA TX2.
- only 66% of the model size of MobileNet
- SSD라는 localization 알고리즘에 적용
  - Pelee, achieves 76.4% mAP (mean average precision) on PASCAL VOC2007 and 22.4 mAP on MS COCO dataset at the speed of 23.6 FPS on iPhone 8 and 125 FPS on NVIDIA TX2.
  - The result on COCO outperforms YOLOv2 in consideration of a higher precision, 13.6 times lower computational cost and 11.3 times smaller model size.

We propose a variant of DenseNet Huang et al. (2016a) architecture called PeleeNet for mobile devices.

Two-Way Dense Layer
- GoogLeNet에서 영감을 얻음.
- to get different scales of receptive fields.
- One way : 3x3 kernel
- Two way : 큰 object에 대한 비주얼 패턴을 학습하기 위해
Stem Block
- Inception-v4 & DSOD 에서 영감을 얻음.
- 첫번째 layer에서, 즉, dense layer전에
  - "we design a cost efficient stem block before the first dense layer."
- 계산 cost가 적게 들면서, feature expression 능력을 향상시키는 역할.
Dynamic Number of Channels in Bottleneck Layer
- DenseNet에서 사용했던 4번의 growth rate 대신에, input shape에 따러 Bottleneck Layer안의 channel를 변화시킴 -> ?
  - DenseNet에서는 초기 몇개의 layer에서 input channel보다 더 큰 channel이 존재함. 이는 계산 cost 증가시키는 요인이 됨.
  - 그래서, input shape에 따라 이를 초과하기 않도록 구조 설계.
  - 이는, DenseNet 보다 28.5% 정도의 계산 cost를 줄임.
Transition Layer without Compression
- DenseNet에 의해 제한된 압축요인은 실제로는, 우리의 실험에서 feature expression을 약화시키는 것을 발견.
- 그래서, transition layer안에 input channel과 동일하게 설계.
Composite Function - 더 찾아봐야겠다.. 정확히 몬뜻인지.ㅠ
- speed를 향상시키기 위해, post-activation 사용 (Convolution - Batch Normalization Ioffe & Szegedy (2015) - Relu) as our composite function instead of pre-activation used in DenseNet.
- For post-activation, all batch normalization layers can be merged with convolution layer at the inference stage, which can accelerate the speed greatly. To compensate for the negative impact on accuracy caused by this change, we use a shallow and wide network structure. We also add a 1x1 convolution layer after the last dense block to get the stronger representational abilities.

We optimize the network architecture of Single Shot MultiBox Detector (SSD) Liu et al. (2016)

for speed acceleration and then combine it with PeleeNet.

SSD 와의 결합.
- Feature Map Selection
  - selected set of 5 scale feature maps (19 x 19, 10 x 10, 5 x 5, 3 x 3, and 1 x 1).
  - 38 x 38 feature map 사용안함.
- Residual Prediction Block
  - 각 scale에의 feature map을 input으로하여 Residual Prediction Block에 통과시킨후 이를 실제 predication에 이용.
- Small Convolutional Kernel for Prediction
  - category와 box를 찾기 위해 가능한 1x1 kernel을 적용 > 실험에서, 3x3 kernels 과 같은 성능을 보임. 반면에 계산 비용은 21.5% 감소.
We provide a benchmark test for different efficient classification models and different one-stage object detection methods on NVIDIA TX2 embedded platform and iPhone 8.

PeleeNet 구조.

실험

Image Classification
Speed on Real Devices
SSD

chullhwan-song / Reading-Paper

Pelee: A Real-Time Object Detection System on Mobile Devices #67

동기

We propose a variant of DenseNet Huang et al. (2016a) architecture called PeleeNet for mobile devices.

We optimize the network architecture of Single Shot MultiBox Detector (SSD) Liu et al. (2016)

PeleeNet 구조.

실험