STVIR / PMTD

Pyramid Mask Text Detector designed by SenseTime Video Intelligence Research team.
215 stars 220 forks source link

Question about crop step in Data augmentation #5

Open HXACA opened 5 years ago

HXACA commented 5 years ago

Due to the gt area is not pure text,I get many wrong regions when I try to randomly crop on the resized image.Is there some tricks in this step?

JingChaoLiu commented 5 years ago

Q: Due to the gt area is not pure text,I get many wrong regions when I try to randomly crop on the resized image.Is there some tricks in this step?

No, we don't apply any tricks in the procedure of crop. But you may need to pay attention to some details of cropping images and generating pyramid labels.


The steps of cropping images and generating pyramid labels are as follows:

The correct cropped bbox

  1. Considering the training speed, we keep the mask in the form of points list, not a H*W image, until the sample are forwarded to the mask branch.

  2. crop the origin text mask

    cropped_text_mask = crop_region ∩ origin_text_mask
                     = Polygon[cropped_points_num, {x,y}]

    note: the cropped_points_num may varies from 3 to 8.

  3. get the bounding box by wrapping the cropped mask with a new bounding box, rather than cropping the origin bounding box. As illustrated in the above image, the cropped origin bounding box may greater than the correct cropped bounding box.

  4. generate the pyramid label for the corresponding predicted bounding box. In our setting, the generation step of pyramid label has been deferred to the stage of calculating the mask loss.

    predicted_bounding_box = (left, top, bottom, right)
    mask_label = cropped_text_mask ∩ predicted_bounding_box
              = Tensor[Channel=1, H=28, W=28] # pyramid label or binary label

    note: though the points_num of the cropped_text_mask varies from 3 to 8, the pyramid label can still handle this variance.

pyramid_label

HXACA commented 5 years ago

@JingChaoLiu Thanks for your response

soldierofhell commented 5 years ago

Hi, actually my questions refer to pyramid label generation, not the cropping, but I'll use this issue quotes :)

  1. Considering the training speed, we keep the mask in the form of points list, not a H*W image, until the sample are forwarded to the mask branch.

You mean you keep them in form of vertices, not interior points, right? So in terms of maskrcnn_benchmark, they are PolygonInstances?

  1. generate the pyramid label for the corresponding predicted bounding box. In our setting, the generation step of pyramid label has been deferred to the stage of calculating the mask loss.

So they're calculated on 28x28 grid? Something like: for p in grid_28x28: for v in vertices: [alpha, beta] = A^-1*b; if alpha>=0 and beta>=0: score(p) = max(1-(alpha+beta),0)

JingChaoLiu commented 5 years ago

You mean you keep them in form of vertices, not interior points, right? So in terms of maskrcnn_benchmark, they are PolygonInstances?

Yes

So they're calculated on 28x28 grid? Something like: ...

Denote the ground-truth mask point list as P=Tensor[points_num, {x,y}] and the predicted bounding box as pred_box = {pred_top, pred_bottom, pred_left, pred_right}. Furthermore, define pred_h = pred_bottom - pred_top and pred_w = pred_right - pred_left. We have tried two schemas:

  1. generate a mask_label within {pred_top, pred_bottom, pred_left, pred_right} based on P, then resize this mask_label from the scale of [pred_h, pred_w] to the scale of [28, 28]

  2. map pred_box from {pred_top, pred_bottom, pred_left, pred_right} to {0, 28, 0, 28} and perform the same map for the points list P, i.e. resized_P = (P-(pred_left, pred_top)) * (28/pred_h, 28/pred_w), finally generate a mask_label within {0, 28, 0, 28} based on resized_P

The schema you mentioned may be schema 2. In our experiments, schema 2 is lower than schema 1 by 0.3% F-measure. But schema 2 is very efficient both for memory and for calculation. The training time of schema 2 is two-third of schema 1.

soldierofhell commented 5 years ago

Thank you @JingChaoLiu for your valuable analysis. It seems like current maskrcnn-benchmark approach is closer to 2., because there're basically three steps:

I don't get it why they're not using roialign here for efficiency

By the way is matrix inversion really necessarily for calculating target? I mean this pyramid function seems like very "regular" and I'm suprised there's no "analytic" formula If not, maybe for efficiency of training some other form like "stepwise" pyramid would be better? Actually I guess polygon approach is kind of more refined idea from EAST where gt "mass" was uniformely concentrated in the center

Regards,

donglin8506 commented 5 years ago

Could you share the code of generating Pyramid label?

JingChaoLiu commented 5 years ago

Here is a simplified version. Adjust these code as you need. @donglin8506

import cv2
import numpy as np

def generate_pyramid_label(H, W, corner_points):
    """

    :param int H: image_H
    :param int W: image_W
    :param np.ndarray corner_points: dtype=np.float32, shape=[point_num, {x,y}] 3 <= point_num <= 8
    :return: np.ndarray ans: dtype=np.float32, shape=[H, W]

    generate a pyramid label from corner_points 
      within the bounding box {box_top=0, box_bottom=H, box_left=0, box_right=W}
    """
    point_num = len(corner_points)
    center = corner_points.mean(axis=0)
    vectors = corner_points - center
    matrices = np.empty((point_num, 2, 2), dtype=np.float32)
    for i in range(point_num):
        m = vectors[[i, (i + 1) % point_num]].T
        matrices[i] = np.linalg.pinv(m)
    points = np.empty((H, W, 2), dtype=np.float32)  # H, W, {x, y}
    points[:, :, 0] = np.arange(W)
    points[:, :, 1] = np.arange(H)[..., None]
    points -= center
    ans: np.ndarray = np.matmul(matrices[:, None, None, ...], points[..., None])
    ans = ans.squeeze()
    ans = (ans >= 0).all(axis=-1) * ans.sum(axis=-1)
    ans = np.max(ans, axis=0)
    ans = np.maximum(1 - ans, 0)
    return ans

def main():
    H, W = 150, 224
    corner_points = np.array([
        187, 0,
        224, 80,
        30, 150,
        0, 65
    ], dtype=np.float32).reshape(-1, 2)

    ans = generate_pyramid_label(H, W, corner_points)

    cv2.imshow('image', ans)
    cv2.waitKey(0)

if __name__ == '__main__':
    main()
donglin8506 commented 5 years ago

@JingChaoLiu Thank you very much, this will give a lot of help, you're welcome! Best regards!

insightcs commented 5 years ago

@JingChaoLiu Thank you for your great work, but I have a question about generating pyramid labels. I generate pyramid mask in your way, but it has also a few white dots, as shown in the figure. Does it affect model training? Ask for your help, thanks. image image

JingChaoLiu commented 5 years ago

@insightcs It's OK. This won't hurt the model training. The phenomenon is caused by the numerical instability of matrix inversion of matrices[i] = np.linalg.pinv(m)

xxlxx1 commented 5 years ago

@insightcs hi, if I want to use this soft mask label, need I add this code to the project? I can't find about soft mask label in the project.