Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
First off, could someone please help reopen this issue #10806 I accidentally closed it and then the bot closed it, seems no way to open it myself.
This issue is directly related to the problem I mentioned in #10806. If you have any Arabic text, by nature if you read character by character in the Arabic text such as using a for loop, Python would read it from right to left. So let's say you want to train some Arabic recognition model, and that your ground-truth label is written left to right, as any English speakers, then you need to be very careful when using for loop to read the character.
So whenever there is "MultiLabelEncode" in the config.yaml file for training, the code here https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/data/imaug/label_ops.py#L153 would give a problem in this scenario.
Fix
a way to fix is to use the Python bidi package
from bidi.algorithm import get_display
for char in get_display(text, base_dir="L"):
Here the argument base_dir="L" is very important. This would iterate through the characters in the left to right order.
Problem
First off, could someone please help reopen this issue #10806 I accidentally closed it and then the bot closed it, seems no way to open it myself. This issue is directly related to the problem I mentioned in #10806. If you have any Arabic text, by nature if you read character by character in the Arabic text such as using a for loop, Python would read it from right to left. So let's say you want to train some Arabic recognition model, and that your ground-truth label is written left to right, as any English speakers, then you need to be very careful when using for loop to read the character. So whenever there is "MultiLabelEncode" in the config.yaml file for training, the code here https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.7/ppocr/data/imaug/label_ops.py#L153 would give a problem in this scenario.
Fix
a way to fix is to use the Python bidi package
Here the argument base_dir="L" is very important. This would iterate through the characters in the left to right order.