Hegelim commented 1 year ago

Issue

I discreetly followed the tutorials (such as this) and fine-tuned on arabic_PP-OCRv3_rec model here for Arabic text recognition. In terms of data, I used 50k synthetic generated Arabic data, formatted it in the ppocr data format (shown here). After I fine-tuned it for 300 epochs, the accuracy started from 0 and finally ended at 81%, which is pretty mediocre.

yml file

Global:
  debug: false
  use_gpu: true
  epoch_num: 300
  log_smooth_window: 20
  print_batch_step: 10
  save_model_dir: ./arabic-recog-model
  save_epoch_step: 2
  eval_batch_step: [0, 2000]
  cal_metric_during_train: true
  pretrained_model: ./arabic-recog-model/latest.pdparams
  checkpoints: 
  save_inference_dir:
  use_visualdl: false
  infer_img: 
  character_dict_path: ./arabic-recognition-training/arabic_dict.txt
  max_text_length: &max_text_length 25
  infer_mode: false
  use_space_char: true
  distributed: true
  save_res_path: ./arabic-recog-model/rec/predicts_ppocrv3_arabic.txt

Optimizer:
  name: Adam
  beta1: 0.9
  beta2: 0.999
  lr:
    name: Piecewise
    decay_epochs: [50, 100]
    values: [0.0001, 0.00005, 0.00001]
    warmup_epoch: 5
  regularizer:
    name: L2
    factor: 3.0e-05

Architecture:
  model_type: rec
  algorithm: SVTR
  Transform:
  Backbone:
    name: MobileNetV1Enhance
    scale: 0.5
    last_conv_stride: [1, 2]
    last_pool_type: avg
  Head:
    name: MultiHead
    head_list:
      - CTCHead:
          Neck:
            name: svtr
            dims: 64
            depth: 2
            hidden_dims: 120
            use_guide: True
          Head:
            fc_decay: 0.00001
      - SARHead:
          enc_dim: 512
          max_text_length: *max_text_length

Loss:
  name: MultiLoss
  loss_config_list:
    - CTCLoss:
    - SARLoss:

PostProcess:  
  name: CTCLabelDecode

Metric:
  name: RecMetric
  main_indicator: acc
  ignore_space: False

Train:
  dataset:
    name: SimpleDataSet
    data_dir: ./arabic-training-50k
    ext_op_transform_idx: 1
    label_file_list:
    - ./arabic-training-50k/gt_train.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - RecConAug:
        prob: 0
        ext_data_num: 2
        image_shape: [48, 320, 3]
    - RecAug:
        prob: 0
    - MultiLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_sar
        - length
        - valid_ratio
  loader:
    shuffle: true
    batch_size_per_card: 64
    drop_last: true
    num_workers: 0
Eval:
  dataset:
    name: SimpleDataSet
    data_dir: ./arabic-training-50k
    label_file_list:
    - ./arabic-training-50k/gt_test.txt
    transforms:
    - DecodeImage:
        img_mode: BGR
        channel_first: false
    - MultiLabelEncode:
    - RecResizeImg:
        image_shape: [3, 48, 320]
    - KeepKeys:
        keep_keys:
        - image
        - label_ctc
        - label_sar
        - length
        - valid_ratio
  loader:
   shuffle: false
    drop_last: false
    batch_size_per_card: 64
    num_workers: 0

Concerns/Questions

Any recommendations to further boost the accuracy to over 95%? Do I need more data? More epochs? A different model?
I realized my base model arabic_PP-OCRv3_rec is only 9.6M, I am concerned whether it has the capacity to recognize arabic texts to 95%? Should I choose another model that is larger? If so, which model should I choose?
Should I fine-tune or should I train from scratch?
In the yml file, one field is img_mode, by default it is "BGR". Should I be concerned about this? Should I change it to “RGB”?

系统环境/System Environment：

Ubuntu 18.04

版本号/Version：Paddle： PaddleOCR：问题相关组件/Related components：

paddlepaddle-gpu: 2.4.2 paddleocr: 2.6.1.0

运行指令/Command Code：

python tools/train.py -c ./arabic-recognition-training/arabic_PP-OCRv3_rec.yml

./arabic-recognition-training is the dir that I created myself.

masoudMZB commented 1 year ago

Your data is not enough to finetune and ACE rtl language, as the paddle default model for Arabic is not perfect, so you need much more data( more fonts, more symbols(paddle default is not good on symbols), more numbers, more real data). also try to optimize hyperparameters for your training process.

if you are native Arabic, Persian, Urdu speaker you know it is RTL language. these recognition models predict LTR. so real labels must be reversed in the training phase. example : wrong label : مسعود | correct label : دوعسم

optimize learning rate. use much lower lr when your epoch increase

the rec model ( SVTR ) is good enough for our use case.

Do not touch the BGR .

Also in arabic_dict some characters are missing like ) * erc. fix it by yourself to prevent wrong gradients.

for infrence time: Also reversing in paddleocr main code is wrong, try to edit it and use bidi-reshaper.

Hegelim commented 1 year ago

Thanks for your reply.

How much data do I need generally? Is 100k enough? or 200k?
How would I reverse the label during training exactly? Which file should I modify and do I also need to use bidi-reshaper?
for lr, currently I have set it to 1e-5 after 100 epochs, is this low enough?
I am a bit confused about the last part you mentioned, which part is wrong and what is bidi-reshaper?

Btw I read #7623 and looks like the direction stuff is fixed, so should I still worry about it??

Hegelim commented 1 year ago

@andyjpaddle could you please help? I need to confirm with developers who designed the arabic recognition model. For the model training, are the texts parsed from left to right? Do I need to worry about the ordering of it?

masoudMZB commented 1 year ago

Thanks for your reply.

How much data do I need generally? Is 100k enough? or 200k?

How would I reverse the label during training exactly? Which file should I modify and do I also need to use bidi-reshaper?

for lr, currently I have set it to 1e-5 after 100 epochs, is this low enough?

I am a bit confused about the last part you mentioned, which part is wrong and what is bidi-reshaper?

Btw I read #7623 and looks like the direction stuff is fixed, so should I still worry about it??

Hi, sorry for late response:

1-) at least 100k is needed, for Ace accuracy you need more data, try to generate as much as possible
2-) about reversing : Persian and Arabic are RTL languages. I mean you write them Right to left and Also read them right to left, but English and others are not, The models are predicting from left to right, so The output of model is reversed . so if you use bidi-resharper it will reverse in proper format
3-) check here : https://github.com/PaddlePaddle/PaddleOCR/blob/release/2.6/doc/doc_en/finetune_en.md#33-training-hyperparameter
4-) python-bidi is ok. paddle code to reverse label for rtl languages is wrong, use https://pypi.org/project/python-bidi/
- check here https://github.com/PaddlePaddle/PaddleOCR/pull/10418

masoudMZB commented 1 year ago

and this is my new arabic_dict.txt file

!
#
$
%
&
'
(
+
,
-
.
/
0
1
2
3
4
5
6
7
8
9
:
?
@
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
_
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
É
é
ء
آ
أ
ؤ
إ
ئ
ا
ب
ة
ت
ث
ج
ح
خ
د
ذ
ر
ز
س
ش
ص
ض
ط
ظ
ع
غ
ف
ق
ك
ل
م
ن
ه
و
ى
ي
ً
ٌ
ٍ
َ
ُ
ِ
ّ
ْ
ٓ
ٔ
ٰ
ٱ
ٹ
پ
چ
ڈ
ڑ
ژ
ک
ڭ
گ
ں
ھ
ۀ
ہ
ۂ
ۃ
ۆ
ۇ
ۈ
ۋ
ی
ې
ے
ۓ
ە
١
٢
٣
٤
٥
٦
٧
٨
٩
)
*
{
}
»
«
؛
،
|
٠
؟
=
;
<
>
[
]
~

Hegelim commented 1 year ago

Thank you for sharing! I looked into your PR, looks like PaddleOCR recognizes letter by letter, and because it is not joint in cursive, it's wrong. However, if you copy and paste the result into any text file, it will automatically become the right cursive form. I am not sure whether it is necessary to use bidi-reshaper in this case?

Hegelim commented 1 year ago

So I just ran an experiment using this image, which means hello in Arabic hello I modified the code to be

    def pred_reverse(self, pred):
        pred_re = []
        c_current = ''
        for c in pred:
            if not bool(re.search('[a-zA-Z0-9 :*./%+-]', c)):
                if c_current != '':
                    pred_re.append(c_current)
                pred_re.append(c)
                c_current = ''
            else:
                c_current += c
        if c_current != '':
            pred_re.append(c_current)

        print(f"after: {''.join(pred_re[::-1])}")
        print(f"use bidi: {get_display(pred)}")
        return ''.join(pred_re[::-1])

The output on terminal looks exactly the same I can't copy and paste because if I do, then both of them will be in the correct, cursive format in this text editor

Hegelim commented 1 year ago

Another question that I need to solve urgently - because Arabic letters change their shapes completely depending on their locations in the words, does that mean I need to include all possible different shapes in the dictionary? Or how can the model learn to recognize different shapes of the same letter?

ruogedol commented 1 year ago

If you need to identify the normal results, you can refer to https://github.com/mpcabd/python-arabic-reshaper and install the arabic_reshaper package in the Python environment. Copy the recognition results to a text editor, and the editor's plugin may automatically correct the word order. The issues shown do not affect model training.

In version 2.5, because of the special variation of Arabic characters, I used a single font to generate word data to train, and the perfect pair rate of recognizing words on PC was 95%. But with this best result, the model is not good to transfer training long sentences.

Hegelim commented 1 year ago

If you need to identify the normal results, you can refer to https://github.com/mpcabd/python-arabic-reshaper and install the arabic_reshaper package in the Python environment. Copy the recognition results to a text editor, and the editor's plugin may automatically correct the word order. The issues shown do not affect model training.

In version 2.5, because of the special variation of Arabic characters, I used a single font to generate word data to train, and the perfect pair rate of recognizing words on PC was 95%. But with this best result, the model is not good to transfer training long sentences.

Did you change any code in ppocr when you trained the model? Or you just leave it as is?

IbrarBabar009 commented 1 year ago

I have the problem which is quite same like that, but not get any response yet 11031

Please have a look

Shewket commented 10 months ago

hi, @masoudMZB , I also training the RTL languages, but not Arabic, when we before the training, any paddle code we need to fix when deal with RTL problem

connorourke commented 10 months ago

@masoudMZB - have you managed to fine-tune the Arabic model? Can you share it if so?

UserWangZz commented 6 months ago

This issue has not been updated for a long time. This issue is temporarily closed and can be reopened if necessary.

omumarvaishya005 commented 3 months ago

No its not accurate arabic dict arabic lan. is much more complicated when a letter is at diff position its shape and meaning get changed im giving you more accurate dict.

Arabic Letters with Variations

Alif

ا ا (Isolated) ـا (Medial) ـا (Final)

Ba

ب بـ (Initial) ـبـ (Medial) ـب (Final)

Ta

ت تـ (Initial) ـتـ (Medial) ـت (Final)

Tha

ث ثـ (Initial) ـثـ (Medial) ـث (Final)

Jeem

ج جـ (Initial) ـجـ (Medial) ـج (Final)

Ha

ح حـ (Initial) ـحـ (Medial) ـح (Final)

Kha

خ خـ (Initial) ـخـ (Medial) ـخ (Final)

Dal

د ـد (Medial) ـد (Final)

Thal

ذ ـذ (Medial) ـذ (Final)

Ra

ر ـر (Medial) ـر (Final)

Zay

ز ـز (Medial) ـز (Final)

Seen

س سـ (Initial) ـسـ (Medial) ـس (Final)

Sheen

ش شـ (Initial) ـشـ (Medial) ـش (Final)

Sad

ص صـ (Initial) ـصـ (Medial) ـص (Final)

Dad

ض ضـ (Initial) ـضـ (Medial) ـض (Final)

Ta

ط طـ (Initial) ـطـ (Medial) ـط (Final)

Tha

ظ ظـ (Initial) ـظـ (Medial) ـظ (Final)

Ain

ع عـ (Initial) ـعـ (Medial) ـع (Final)

Ghain

غ غـ (Initial) ـغـ (Medial) ـغ (Final)

Fa

ف فـ (Initial) ـفـ (Medial) ـف (Final)

Qaf

ق قـ (Initial) ـقـ (Medial) ـق (Final)

Kaf

ك كـ (Initial) ـكـ (Medial) ـك (Final)

Lam

ل لـ (Initial) ـلـ (Medial) ـل (Final)

Meem

م مـ (Initial) ـمـ (Medial) ـم (Final)

Noon

ن نـ (Initial) ـنـ (Medial) ـن (Final)

Ha

ه هـ (Initial) ـهـ (Medial) ـه (Final)

Waw

و ـو (Medial) ـو (Final)

Ya

ي يـ (Initial) ـيـ (Medial) ـي (Final)

Taa Marboota

ة ـة (Final)

Alif Maqsura

ى ـى (Final)

Alif Hamzah

أ ـأ (Medial) ـأ (Final)

Alif Hamzah Below

إ ـإ (Medial) ـإ (Final)

Waw Hamzah

ؤ ـؤ (Medial) ـؤ (Final)

Ya Hamzah

ئ ئـ (Initial) ـئـ (Medial) ـئ (Final)

Arabic Numerals

٠ ١ ٢ ٣ ٤ ٥ ٦ ٧ ٨ ٩ ۰ ۱ ۲ ۳ ۴ ۵ ۶ ۷ ۸ ۹

Punctuation Marks and Special Characters

، ؛ ؟ « » ٪ ÷ ۞ ؆ ؇ ؈

English Letters

Uppercase

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Lowercase

a b c d e f g h i j k l m n o p q r s t u v w x y z

English Numerals

0 1 2 3 4 5 6 7 8 9

omumarvaishya005 commented 3 months ago

Hey, i have a question here for arabic language is it trained to recognized alphabets or whole word since it is cursive language ??? like in english for word ram it will recognize r, a, m separately for arabic its also works like this??

PaddlePaddle / PaddleOCR

How to get to over 95% accuracy on fine-tuning Arabic recognition #10358

Issue

yml file

Concerns/Questions

系统环境/System Environment：

版本号/Version：Paddle： PaddleOCR： 问题相关组件/Related components：

运行指令/Command Code：

Arabic Letters with Variations

Alif

Ba

Ta

Tha

Jeem

Ha

Kha

Dal

Thal

Ra

Zay

Seen

Sheen

Sad

Dad

Ta

Tha

Ain

Ghain

Fa

Qaf

Kaf

Lam

Meem

Noon

Ha

Waw

Ya

Taa Marboota

Alif Maqsura

Alif Hamzah

Alif Hamzah Below

Waw Hamzah

Ya Hamzah

Arabic Numerals

Punctuation Marks and Special Characters

English Letters

Uppercase

Lowercase

English Numerals

版本号/Version：Paddle： PaddleOCR：问题相关组件/Related components：