PixelLink: Detecting Scene Text via Instance Segmentation

abstract

semantic segmentation 의 한 기법으로 text detection을 품
- 하지만, natural scene에 존재하는 text instance들이 너무 밀집되어 있어서 semantic segmentation으로 풀기 어렵움. > instance segmentation
PixelLink라는 instance segmentation 기법으로 이 문제를 접근
- 기본적으로 FCN 방법으로써, instance보단 semantic segmentation에 더 가까다고 생각드는데, 바로 word단위로 하기 때문에밑의 의미로써 instance 인듯.
many fewer training iterations 그리고 less training data의 장점.

동기

위의 문제를 풀기 위해(instance segmentation)
- EAST같은 알고리즘은 Base network는 거의 동일한데 뒤쪽에, box regression 이 존재함 > PixelLink는 이러한 layers를 두지않고 text location을 finding.
- It extracts text locations directly from an instance segmentation result, instead of from bounding box regression.
크게 2개
- text/non-text prediction
- link prediction.

Detecting Text via Instance Segmentation

backbone - VGG16
- 이전 유사한 연구에서, 많이 사용 - SSD and SegLink
Two settings of feature fusion layers are implemented:
- PixelLink+VGG16 2s : {conv2_2, conv3_3, conv4_3, conv5_3, fc_7},
- PixelLink+VGG16 4s : {conv3_3, conv43, conv5, fc_7},

Linking Pixels Together

pixels & links 정보가 주어짐.
- 어떤 형태인지는 제대로 나오지 않음. > 두개가 구분되어 있는지.. 정확한 그림이 있다면 더 이해하기 쉬운데..
- 이들 각각 다은 기준(thresholds)으로 분리되고
pixel(text를 나타내는, Positive )은 link에 의해 grouping > instance segmentation is achieved.
이웃 positive pixel들이 주어졌을때, 이와 관련된 두개의 link는 하나또는 양쪽 link prediction이 positive일때 연결.

Given predictions on pixels and links, two different thresholds can be applied on them separately. 
Positive pixels are then grouped together using positive links, resulting in a collection of CCs, 
each representing a detected text instance. 
Thus instance segmentation is achieved. It is worth noting that, given two neighboring
 positive pixels,  their link are predicted by both of them, and they should be connected 
when one or both of the two link predictions are positive. 
This linking process can be implemented using disjoint-set data structure.

Extraction of Bounding Boxes

like minAreaRect in OpenCV > connected component labeling ?

Post Filtering after Segmentation

links에 의해 Pixel들을 grouping하는데 noise predictions가지는 경우에는,
- 각 benchmark set의 특징에 따라 threshold등을 다르게 주는듯..ㅠ
  - a detected box is abandoned if its shorter side is less than 10 pixels or if its area is smaller than 300. The 10 and 300 are statistical results on the training data of IC15.

전체 프로세스 Fig

두개의 pixel 기반의 prediction 시도
- text/non-text
- link
positive pixel은 link에 의해 grouping되는게 원칙.
minAreaRect 알고리즘을 통해 bounding box를 계산함 -> 실제 회전까지 가능함.
link prediction은 8개의 direction을 가진 heatmap 형태인듯한데..(이게 어떻게 구해지는 loss를 봐야하나? loss에서 보면 사각형 형태가 맞는듯.> 따로 구분되지는 듯..)
- 이들의 전체 조합을 통해 그림을 보면, closed한 다른 text instance 들을 분리할수 있음 - 적당한 threshold가 필요할듯~
- 8개의 direction이라고 했는데, 이는 이웃pixel을 얼마나 보느냐에 따라 결정.
  - 예를 들어, 8개면, 현재 중앙pxiel을 중심으로, 상좌, 상중, 상우, 중좌, 중우, 하좌, 하우, 하중 이렇게, 주위의, 1 sift(바로 한 pixel 밖의 이웃픽셀들) 하면서, link loss를 보는 듯함. 그래서 위의 아래 8개의 heatmap을 생성하여 최종 조합으로 detection에 적용하는듯보임
pixel & link가 정확히 몬지 구체적이지 않음. 그냥 word단위의 전체 사각형인듯으로 추측됨(위의 그림으로 보면..) - segmentation 접근이니 라벨링 이미지 자체가 word 단위의 b/w 이미지 일듯. white는 word, black은 background

전체 network 구조

U-Net기반의 FCN 기반의 segmentation인듯한데.. U-Net이라는 단어 자체가 안나옴.
EAST와도 매우 유사한듯한데.ㅎ

Loss Function

Loss on Pixels
- Positive Pixel > 그냥 instance단위로 구분되지 않고 모두 word라면 같은 weight를 줌.
- 기본방침은 위와 같으나, 크기를 고려함. >( 내생각에는 크게 해칠것같지는 않으나..0.1%라도 성능을 올리려면.ㅠ)
  - it’s unfair to instances with small areas, and may hurt the performance.
- Instance-Balanced Cross-Entropy Loss
- Online Hard Example Mining (OHEM)을 적용 - 여러운것만..
  - 실제 구현에는 "Online Hard Negative Mining"이란 용어를 썼음.
Loss on Links
- positive and negative link을 각각 분리

실험

각 benchmark set
학습셋을 많이 사용하지 않아도 -궁금한점은 다 사용하면 얼마나 더 좋아질까??

chullhwan-song / Reading-Paper

PixelLink: Detecting Scene Text via Instance Segmentation #71

abstract

동기

Detecting Text via Instance Segmentation

Linking Pixels Together

Extraction of Bounding Boxes

Post Filtering after Segmentation

전체 프로세스 Fig

전체 network 구조

Loss Function

실험