STVIR / PMTD

Pyramid Mask Text Detector designed by SenseTime Video Intelligence Research team.
215 stars 220 forks source link

Question about Algorithm 1 Plane Clustering #3

Open bluekingsong opened 5 years ago

bluekingsong commented 5 years ago

Q1. Why Algorithm 1 's inputs is all segmentation result of a image( H*W points ), while its outputs is just only single one text bounding box ( 4 planes )

Q2. what's the detail about INITPLANES function? what parameters(A, B, D) is after calling the function ? I' cannot see from the paper.

Thanks !

JingChaoLiu commented 5 years ago

Q1. Why Algorithm 1 's inputs is all segmentation result of a image( H*W points ), while its outputs is just only single one text bounding box ( 4 planes )

A1: The plane clustering algorithm aims to rebuild the pyramid for one text mask basing on a single text region(text_mask = Tensor[C=1, H=28, W=28])). The input of this algorithm is indeed a text mask, not the whole image. For one image which normally contains a dozen text regions, the plane clustering will try to rebuild one pyramid for each text mask respectively.

Q2. what's the detail about INITPLANES function? what parameters(A, B, D) is after calling the function ? I' cannot see from the paper.

A2: Given the positive point list P = Tensor[point_num, Channel={x, y, z}], the INIT_PLANES will do these things:

  1. calculate out the approximate apex of the pyramid, noted as E.
    E.x, E.y = mean(P[:, :2])
    E.z = 1
  2. produce the initial planes. we simplily link the apex E to the corner points {(0, 0, 0), (0, 28, 0), (28, 28, 0), (28, 0, 0)} to form the four inclined plane. note: (0, 0, 0) means x=0, y=0, z=0

    Another thing worth to mention: During the procedure of clustering, this algorithm only cares about the four independent inclined planes. In other words, we only require the four inclined planes to be independent, without the constraint of sharing a common apex. So we can rebuild both pyramid and square frustum from the text mask. 1