QuantumLiu commented 3 years ago

使用提供的预训练模型来检测阿拉伯语，几乎没有一个准确的单词被识别。阿拉伯语和维吾尔语都是从右往左写的，字母之间有连写，你们在训练的时候是否没有考虑到这一点？预测的图片：广告牌结果：

[2021/03/17 04:08:00] root INFO: داددتلا, 0.884
[2021/03/17 04:08:00] root INFO: جذوهن, 0.915
[2021/03/17 04:08:00] root INFO: يذاكسلا, 0.858
[2021/03/17 04:08:00] root INFO: ءلااوددتسا, 0.876
[2021/03/17 04:08:00] root INFO: ةبحصلا, 0.979
[2021/03/17 04:08:00] root INFO: ةباعرلا, 0.965
[2021/03/17 04:08:00] root INFO: جمارب, 0.900
[2021/03/17 04:08:00] root INFO: حمذاء, 0.849
[2021/03/17 04:08:00] root INFO: اومسراو, 0.968
[2021/03/17 04:08:00] root INFO: لجأر, 0.738
[2021/03/17 04:08:00] root INFO: عتجذا, 0.746
[2021/03/17 04:08:00] root INFO: نه, 0.883
[2021/03/17 04:08:00] root INFO: يبردلا, 0.969
[2021/03/17 04:08:00] root INFO: إياء, 0.832
[2021/03/17 04:08:00] root INFO: شكأاوذرءا, 0.877
[2021/03/17 04:08:00] root INFO: مكردراب, 0.840
[2021/03/17 04:08:00] root INFO: اوكفك, 0.886
[2021/03/17 04:08:00] root INFO: المساووايااهو, 0.625
[2021/03/17 04:08:00] root INFO: احمعنى, 0.563
[2021/03/17 04:08:00] root INFO: مكابقتسم, 0.938
[2021/03/17 04:08:00] root INFO: حمدام, 0.969
[2021/03/17 04:08:00] root INFO: 2020CENSUS GOV ar, 0.931
[2021/03/17 04:08:00] root INFO: حمحه, 0.600
[2021/03/17 04:08:00] root INFO: مانهه, 0.765
[2021/03/17 04:08:00] root INFO: نم, 0.968
[2021/03/17 04:08:00] root INFO: وقدبا, 0.864
[2021/03/17 04:08:00] root INFO: لره, 0.644
[2021/03/17 04:08:00] root INFO: اتمعفاا, 0.538
[2021/03/17 04:08:00] root INFO: تققر, 0.706
[2021/03/17 04:08:00] root INFO: وهزهوع, 0.712

$ python3 tools/infer/predict_system.py --image_dir="ar1.jpeg" --det_model_dir="./inference/ch_ppocr_server_v2.0_det_infer/" --rec_model_dir="./inference/ar_mobile_v2.0_rec_infer" --cls_model_dir="./inference/ch_ppocr_mobile_v2.0_cls_infer/" --use_angle_cls=True --use_space_char=True --use_gpu=False --rec_char_type="ar" --rec_char_dict_path="ppocr/utils/dict/ar_dict.txt"

1879

tink2123 commented 3 years ago

阿拉伯语和维吾尔语都是从右往左写的，字母之间有连写，你们在训练的时候是否没有考虑到这一点？

我们不太熟悉阿拉伯语书写规则，因此训练时没有考虑到这点。也有一些用户提过相关问题，我们正在优化这个模型。预计在4月份会更新一版，请问除了书写顺序还有其他需要注意的问题吗？

QuantumLiu commented 3 years ago

阿拉伯语和维吾尔语都是从右往左写的，字母之间有连写，你们在训练的时候是否没有考虑到这一点？

我们不太熟悉阿拉伯语书写规则，因此训练时没有考虑到这点。也有一些用户提过相关问题，我们正在优化这个模型。预计在4月份会更新一版，请问除了书写顺序还有其他需要注意的问题吗？

很期待看到你们的新模型。除了书写顺序，还有就是每个字母根据连写位置和前面字母的不同，有4~6种写法，不知道咱们标注和生成数据的系统是否正确处理了。非阿拉伯字符和阿拉伯字符混用时，用计算机排版会比较混乱。比如这句话： الإقتصادية الرئيسية الخميس 11 أبريل 2019 首席经济学家，2019年4月11日，星期四阿拉伯字符从右到左读，أبريل这几个字符，从字符串的顺序来说是在2019前面，而2019这几个非阿拉伯语字符串又是从左到右读。所以如果模型识别方向是从右到左，那识别出来很可能学习的是2019镜像的特征，将9的镜像识别为2.

tink2123 commented 3 years ago

非常感谢您的建议，按照您上面提出的几点问题，我们更新了一版阿拉伯语的模型，希望您试用并提出修改意见：

pip install paddleocr==2.0.6
paddleocr --image_dir {your/img/path} --lang=ar

paddle-bot-old[bot] commented 2 years ago

Since you haven\'t replied for more than 3 months, we have closed this issue/pr. If the problem is not solved or there is a follow-up one, please reopen it at any time and we will continue to follow up. It is recommended to pull and try the latest code first. 由于您超过三个月未回复，我们将关闭这个issue/pr。若问题未解决或有后续问题，请随时重新打开（建议先拉取最新代码进行尝试），我们会继续跟进。

wuxuedaifu commented 6 months ago

非常感谢您的建议，按照您上面提出的几点问题，我们更新了一版阿拉伯语的模型，希望您试用并提出修改意见：
pip install paddleocr==2.0.6
paddleocr --image_dir {your/img/path} --lang=ar

请问可以把阿拉伯语的训练数据集共享出来吗

PaddlePaddle / PaddleOCR

识别阿拉伯语结果非常差 #2270

2260 https://github.com/PaddlePaddle/PaddleOCR/issues/1879