Closed Hegelim closed 6 months ago
Your data is not enough to finetune and ACE rtl language, as the paddle default model for Arabic is not perfect, so you need much more data( more fonts, more symbols(paddle default is not good on symbols), more numbers, more real data). also try to optimize hyperparameters for your training process.
if you are native Arabic, Persian, Urdu speaker you know it is RTL language. these recognition models predict LTR. so real labels must be reversed in the training phase. example : wrong label : مسعود | correct label : دوعسم
optimize learning rate. use much lower lr when your epoch increase
the rec model ( SVTR ) is good enough for our use case.
Do not touch the BGR .
Also in arabic_dict some characters are missing like ) * erc. fix it by yourself to prevent wrong gradients.
for infrence time: Also reversing in paddleocr main code is wrong, try to edit it and use bidi-reshaper.
Thanks for your reply.
Btw I read #7623 and looks like the direction stuff is fixed, so should I still worry about it??
@andyjpaddle could you please help? I need to confirm with developers who designed the arabic recognition model. For the model training, are the texts parsed from left to right? Do I need to worry about the ordering of it?
Thanks for your reply.
- How much data do I need generally? Is 100k enough? or 200k?
- How would I reverse the label during training exactly? Which file should I modify and do I also need to use bidi-reshaper?
- for lr, currently I have set it to 1e-5 after 100 epochs, is this low enough?
- I am a bit confused about the last part you mentioned, which part is wrong and what is bidi-reshaper?
Btw I read #7623 and looks like the direction stuff is fixed, so should I still worry about it??
Hi, sorry for late response:
and this is my new arabic_dict.txt file
!
#
$
%
&
'
(
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
?
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
_
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
É
é
ء
آ
أ
ؤ
إ
ئ
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ط
ظ
ع
غ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ً
ٌ
ٍ
َ
ُ
ِ
ّ
ْ
ٓ
ٔ
ٰ
ٱ
ٹ
پ
چ
ڈ
ڑ
ژ
ک
ڭ
گ
ں
ھ
ۀ
ہ
ۂ
ۃ
ۆ
ۇ
ۈ
ۋ
ی
ې
ے
ۓ
ە
١
٢
٣
٤
٥
٦
٧
٨
٩
)
*
{
}
»
«
؛
،
|
٠
؟
=
;
<
>
[
]
~
Thank you for sharing! I looked into your PR, looks like PaddleOCR recognizes letter by letter, and because it is not joint in cursive, it's wrong. However, if you copy and paste the result into any text file, it will automatically become the right cursive form. I am not sure whether it is necessary to use bidi-reshaper in this case?
So I just ran an experiment using this image, which means hello in Arabic I modified the code to be
def pred_reverse(self, pred):
pred_re = []
c_current = ''
for c in pred:
if not bool(re.search('[a-zA-Z0-9 :*./%+-]', c)):
if c_current != '':
pred_re.append(c_current)
pred_re.append(c)
c_current = ''
else:
c_current += c
if c_current != '':
pred_re.append(c_current)
print(f"after: {''.join(pred_re[::-1])}")
print(f"use bidi: {get_display(pred)}")
return ''.join(pred_re[::-1])
The output on terminal looks exactly the same I can't copy and paste because if I do, then both of them will be in the correct, cursive format in this text editor
Another question that I need to solve urgently - because Arabic letters change their shapes completely depending on their locations in the words, does that mean I need to include all possible different shapes in the dictionary? Or how can the model learn to recognize different shapes of the same letter?
If you need to identify the normal results, you can refer to https://github.com/mpcabd/python-arabic-reshaper and install the arabic_reshaper package in the Python environment. Copy the recognition results to a text editor, and the editor's plugin may automatically correct the word order. The issues shown do not affect model training.
In version 2.5, because of the special variation of Arabic characters, I used a single font to generate word data to train, and the perfect pair rate of recognizing words on PC was 95%. But with this best result, the model is not good to transfer training long sentences.
If you need to identify the normal results, you can refer to https://github.com/mpcabd/python-arabic-reshaper and install the arabic_reshaper package in the Python environment. Copy the recognition results to a text editor, and the editor's plugin may automatically correct the word order. The issues shown do not affect model training.
In version 2.5, because of the special variation of Arabic characters, I used a single font to generate word data to train, and the perfect pair rate of recognizing words on PC was 95%. But with this best result, the model is not good to transfer training long sentences.
Did you change any code in ppocr when you trained the model? Or you just leave it as is?
I have the problem which is quite same like that, but not get any response yet 11031
Please have a look
hi, @masoudMZB , I also training the RTL languages, but not Arabic, when we before the training, any paddle code we need to fix when deal with RTL problem
@masoudMZB - have you managed to fine-tune the Arabic model? Can you share it if so?
This issue has not been updated for a long time. This issue is temporarily closed and can be reopened if necessary.
No its not accurate arabic dict arabic lan. is much more complicated when a letter is at diff position its shape and meaning get changed im giving you more accurate dict.
ا ا (Isolated) ـا (Medial) ـا (Final)
ب بـ (Initial) ـبـ (Medial) ـب (Final)
ت تـ (Initial) ـتـ (Medial) ـت (Final)
ث ثـ (Initial) ـثـ (Medial) ـث (Final)
ج جـ (Initial) ـجـ (Medial) ـج (Final)
ح حـ (Initial) ـحـ (Medial) ـح (Final)
خ خـ (Initial) ـخـ (Medial) ـخ (Final)
د ـد (Medial) ـد (Final)
ذ ـذ (Medial) ـذ (Final)
ر ـر (Medial) ـر (Final)
ز ـز (Medial) ـز (Final)
س سـ (Initial) ـسـ (Medial) ـس (Final)
ش شـ (Initial) ـشـ (Medial) ـش (Final)
ص صـ (Initial) ـصـ (Medial) ـص (Final)
ض ضـ (Initial) ـضـ (Medial) ـض (Final)
ط طـ (Initial) ـطـ (Medial) ـط (Final)
ظ ظـ (Initial) ـظـ (Medial) ـظ (Final)
ع عـ (Initial) ـعـ (Medial) ـع (Final)
غ غـ (Initial) ـغـ (Medial) ـغ (Final)
ف فـ (Initial) ـفـ (Medial) ـف (Final)
ق قـ (Initial) ـقـ (Medial) ـق (Final)
ك كـ (Initial) ـكـ (Medial) ـك (Final)
ل لـ (Initial) ـلـ (Medial) ـل (Final)
م مـ (Initial) ـمـ (Medial) ـم (Final)
ن نـ (Initial) ـنـ (Medial) ـن (Final)
ه هـ (Initial) ـهـ (Medial) ـه (Final)
و ـو (Medial) ـو (Final)
ي يـ (Initial) ـيـ (Medial) ـي (Final)
ة ـة (Final)
ى ـى (Final)
أ ـأ (Medial) ـأ (Final)
إ ـإ (Medial) ـإ (Final)
ؤ ـؤ (Medial) ـؤ (Final)
ئ ئـ (Initial) ـئـ (Medial) ـئ (Final)
٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹
، ؛ ؟ « » ٪ ÷ ۞ ؆ ؇ ؈
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
0 1 2 3 4 5 6 7 8 9
Hey, i have a question here for arabic language is it trained to recognized alphabets or whole word since it is cursive language ??? like in english for word ram it will recognize r, a, m separately for arabic its also works like this??
Issue
I discreetly followed the tutorials (such as this) and fine-tuned on arabic_PP-OCRv3_rec model here for Arabic text recognition. In terms of data, I used 50k synthetic generated Arabic data, formatted it in the ppocr data format (shown here). After I fine-tuned it for 300 epochs, the accuracy started from 0 and finally ended at 81%, which is pretty mediocre.
yml file
Concerns/Questions
img_mode
, by default it is "BGR". Should I be concerned about this? Should I change it to “RGB”?系统环境/System Environment:
Ubuntu 18.04
版本号/Version:Paddle: PaddleOCR: 问题相关组件/Related components:
paddlepaddle-gpu: 2.4.2 paddleocr: 2.6.1.0
运行指令/Command Code:
./arabic-recognition-training is the dir that I created myself.