FOTS: Fast Oriented Text Spotting with a Unified Network

Abstract

Incidental scene text spotting 과제(우발적인 scene에서 존재하는 text 탐지, 캠으로 찍어댔거나..)를 가장 어려운 문제로 알려져 있음. (the most difficult and valuable challenges)
대부분의 과제들은 detection과 recognition를 분리하여 연구
이 연구에서는 end to end > Fast Oriented Text Spotting (FOTS) network 제안.
- simultaneous detection and recognition, sharing computation and visual information among the two complementary tasks.
특별히, RoIRotate라는 detection과 recognition 사이에 공유하는 feature를 소개
sharing computation 전략의 장점으로는,
- baseline text detection network와 비교해볼때, 적은 computation overhead
- joint training method에 의한 학습은 "detection과 recognition를 분리"하는 것보다 더 좋은 성능을 냄.
ICDAR 2015, ICDAR 2017 MLT, and ICDAR 2013 datasets outperforms
fast speed - 22.6 fps

end-to-end trainable framework
- fast oriented text spotting > 회전된 text 감지 그것도 아주 빠르게
- sharing convolutional features > detection과 recognition > real-time speed & little computation overhead
RoIRotate 란 개념 제시
- convolutional feature mas에서 oriented text regions를 추출하기 위한 새로운 differentiable operator
- end-to-end pipeline에서, detection과 recognition 를 통합.
text detection에서 outperforms

FOTS - end-to-end trainable framework
- four parts : shared convolutions, text detection RoIRotate, text recognition
  Overall Architecture
전체 구조
shared convolutions
- ResNet-50 backbone -
- U-Net구조
- 1/4씩 down-sampling > 1/2이 아님.
text detection의 output를 이용하여, RoIRotate 에 적용함.
- converts corresponding shared features into fixed-height representations while keeping the original region aspect ratio. - CRNN의 input으로 사용하려는듯.
최종적으로 CNN-LSTM-CTC 구성인 인식과정 - CRNN과 거의 같음

영감을 얻은 연구와 EAST & Deep Direct Regression for Multi-Oriented Scene Text Detection
- 두 연구의 network 구조는 다르나 FCN 기반
Natural Scence에서는 small size의 text 많이 존재 그래서, upscale 할때, 1/32>1/4까지 (downscale할때 그렇게 했기 때문에..) > Fig.3
- in shared convolutions.
그후,
- 첫번째 channel > dense per-pixel predictions > text 인지 아닌지?
- 두번째 channel > EAST와 비슷하게 text(positive sample)를 shrunk > bounding box를 예측하기 위해, (top, bottom, left, right )에 대한 거리 예측,
- 세번째 channel > bounding box에 대한 orientation 예측 > word의 단위의 text가 기울어진정도..
- 이후, 합치고..NMS
loss : text classification & bounding box regression
text detection에서의 prediction/loss는 network를 제외하고(많이 유사하다..) EAST를 그대로 가져온듯~

text detection에서 획득한 region 즉, 회전된 text region을 align(수평으로 만든다는 의미)하여 feature map을 획득(axis-aligned feature map) > Fig.4
- height과 aspect ratio를 유지하면서..
for extracting features for regions of interest. > bilinear interpolation > avoids mis-alignments between the RoI and the extracted features ???
이러한 과정은 two step : affine transformation & interpolation > 통해 최종 feature map을 획득

why > detection 성능이 높은가?
- 이전 연구와 network와 알고리즘차이는 그닥없는듯..
- 학습셋 - transfer learning & 학습셋의 차이. ??
- 다음장에서 이에 대한 힌트가 될듯..,

ImageNet 기반 pre-trained model 이용
training process includes two steps
- Synth800k dataset 사용
- 리얼데이터 적용 *위의 내용으로 보아 일단 두번의 transfer learning(fine-tuned) 사용했을을 알수 있음.
- ImageNet > Synth800k > 실제 target data
Data augmentation > real data
- 1th) longer sides of images are resized from 640 pixels to 2560 pixels
- 2th) images are rotated in range [−10도, , 10도] ] randomly
- 3th) the heights of images are rescaled with ratio from 0.8 to 1.2 while their widths keep unchanged.
- 4th) 640×640 random samples are cropped from the transformed images.
OHEM기반 학습