MichalBusta / E2E-MLT

E2E-MLT - an Unconstrained End-to-End Method for Multi-Language Scene Text
MIT License
291 stars 84 forks source link

Affine grid parameters to crop predicted Bounding Boxes without transformation #79

Open AniketGurav opened 2 years ago

AniketGurav commented 2 years ago

I have difficulty in understanding the parameters of affine_grid.

The Corresponding line number is 233,234 in train.py

As per my understanding, the following things are happening in function process_boxes to which the above line belongs.

  1. Localization part of the network has already predicted all the BB of scene text.
  2. While iterating through all BB predicted STN (Spatial Transformer Network)is used to crop the specific text word only from the entire image.
  3. cropped images are passed through OCR .
  4. The OCR loss is backpropagated The affine_grid which is part of STR has parameters theta (line 233 in train.py)

This theta is 2*3 matrix where the last column is the center coordinate of the predicted crop, remaining 1st two columns help to do transformations like rotation, etc.

When the above part is used I found cropped image gets distorted due to affine_grid transformation and this may affect the ocr output.

What I want is only cropped text image without any transformation using STN (Affine_grid), I have tried following values for the theta matrix

[ 1 0 predX 0 1 predY ]

Where predX and predY are centres of predicted bounding boxes.

After applying this also crops are a few times unrecognizable or look significantly different.

So inshort can you suggest the parameters of theta such that it only crops the predicted BB by network without any transformation.