Canjie-Luo / MORAN_v2

MORAN: A Multi-Object Rectified Attention Network for Scene Text Recognition
MIT License
635 stars 152 forks source link

Is there any rectified methods processing arbitrary length samples without fixed input(e.g. 32x100) in a batch #102

Closed liangzimei closed 5 years ago

liangzimei commented 5 years ago

Thanks for your great work , it does work. However, in practice, to handle very long text (train set has no such long samples) in inference phase, we often train a model keeping the ratio and padding rather than fixed input(e.g. 32*100).
When there is no rectified module, it works successfully. when adding a rectified module, keeping ratio and padding is difficult. Is there any rectified methods processing arbitrary length samples without fixed input(e.g. 32x100) in a batch? Thanks in advance……

Canjie-Luo commented 5 years ago

Thanks for your support!

Actually, I proposed MORAN to address small range deformation of text. As your irregular text is very long, I am afraid that the text in a semicircle is too difficult to rectify. You may need a curve text detector.

The output size of the rectification network of MORAN is not fixed (different from ASTER, which fix the number of the points). Theoretically, the rectification network of MORAN is able to be trained with fix input, and generalize well on the text with variable length.

For long text, a CRNN-based recognizer trained using CTC loss usually performs better. (It is reported by several papers that attention mechanism performs well only on short text.)

liangzimei commented 5 years ago

Thanks for your reply. i will then try a curve text detector and use its outline to rectify.

liangzimei commented 4 years ago

@Canjie-Luo hello, sorry to bother you, can you give some links about the papers saying that attention mechanism performs well only on short text? thanks……

Canjie-Luo commented 4 years ago

[ICDAR 2019] A Comparative Study of Attention-based Encoder-Decoder Approaches to Natural Scene Text Recognition.pdf

jake221 commented 4 years ago

Theoretically, the rectification network of MORAN is able to be trained with fix input, and generalize well on the text with variable length.

thanks for your great work and your sharing. I try to utilize your code to recognize image with variable width. Such as this picture with size 32*487: image

And I modify the demo.py (with your pretrained model) in the following aspect:

if torch.cuda.is_available(): cuda_flag = True MORAN = MORAN(1, len(alphabet.split(':')), 256, 32, 800, BidirDecoder=True, CUDA=cuda_flag) MORAN = MORAN.cuda() else: MORAN = MORAN(1, len(alphabet.split(':')), 256, 32, 800, BidirDecoder=True, inputDataType='torch.FloatTensor', CUDA=cuda_flag)

resize image

image = Image.open(img_path).convert('L') scale = image.size[1] * 1.0 / 32 w = int(image.size[0] / scale)

converter = utils.strLabelConverterForAttention(alphabet, ':') transformer = dataset.resizeNormalize((w, 32)) image = transformer(image)

if cuda_flag: image = image.cuda() image = image.view(1, image.size()) image = Variable(image) text = torch.LongTensor(1 50) length = torch.IntTensor(1 * 5) text = Variable(text) length = Variable(length)

max_iter = 100 t, l = converter.encode('0'*max_iter) utils.loadData(text, t) utils.loadData(length, l) output = MORAN(image, length, text, text, test=True, debug=True)

However, I still got the worse output such as this:

Left to Right: ronaltherlyth

Could you tell me where I am wrong?

Canjie-Luo commented 4 years ago

Can you give the rectified image?