PaddlePaddle / PaddleOCR

Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
https://paddlepaddle.github.io/PaddleOCR/
Apache License 2.0
42.74k stars 7.68k forks source link

Bug when doing CTC/SAR MultiLabel encode with Arabic #10838

Open Hegelim opened 1 year ago

Hegelim commented 1 year ago

Problem

First off, could someone please help reopen this issue #10806 I accidentally closed it and then the bot closed it, seems no way to open it myself. This issue is directly related to the problem I mentioned in #10806. If you have any Arabic text, by nature if you read character by character in the Arabic text such as using a for loop, Python would read it from right to left. So let's say you want to train some Arabic recognition model, and that your ground-truth label is written left to right, as any English speakers, then you need to be very careful when using for loop to read the character. So whenever there is "MultiLabelEncode" in the config.yaml file for training, the code here https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/data/imaug/label_ops.py#L153 would give a problem in this scenario.

Fix

a way to fix is to use the Python bidi package

from bidi.algorithm import get_display
for char in get_display(text, base_dir="L"):

Here the argument base_dir="L" is very important. This would iterate through the characters in the left to right order.

connorourke commented 4 months ago

Any update on this and whether the fix is correct @Hegelim?

Struggling to get sensible recognition from Paddle for Arabic.