Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
GNU General Public License v3.0
1.05k stars 210 forks source link

version of Calamari to use to recognize correctly Arabic databases? #230

Open Tailor2019 opened 3 years ago

Tailor2019 commented 3 years ago

Hello! @ChWick Please I'm using Calamari 1.0 to recognize Arabic databases but it don't give me any result "error ~=100" Is this caused by the version used ? or there is some parameters to change to have the expected result? Thanks in advance

ChWick commented 3 years ago

Hi, can you provide some more information about the error (log output/console output)? Also include the command you used.

Tailor2019 commented 3 years ago

Hi, Thanks for your reply.

!calamari-train --files *.png --weights '/..../4.ckpt'  --output_dir '/outputfolder' --checkpoint_frequency 1

Where 4.ckpt is one of Arabic models published with your Calamari models (I'm using the other Arabic models) but the same results given.

Resolving input files
Found 6472 files in the dataset
Preloading dataset type DataSetMode.TRAIN with size 6472
Preloading data: 100% 6472/6472 [38:53<00:00,  2.77it/s]
Computing codec: 100% 6472/6472 [00:00<00:00, 101946.26it/s]
Checkpoint version 2 is up-to-date.
Codec changes: 68 deletions, 8 appends
CODEC: ['', ' ', '!', "'", '(', ')', '*', '+', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', '<', '>', 'L', '[', ']', 'a', 'e', 'o', 'r', 's', 'x', '،', '؛', '؟', 'ء', 'آ', 'أ', 'ؤ', 'إ', 'ئ', 'ا', 'ب', 'ة', 'ت', 'ث', 'ج', 'ح', 'خ', 'د', 'ذ', 'ر', 'ز', 'س', 'ش', 'ص', 'ض', 'ط', 'ظ', 'ع', 'غ', 'ـ', 'ف', 'ق', 'ك', 'ل', 'م', 'ن', 'ه', 'و', 'ى', 'ي', 'ً', 'ٌ', 'ٍ', 'َ', 'ُ', 'ِ', 'ّ', 'ْ', '%', ';', '=', '?', '\\', 'w', '×', '–']
Checkpoint version 2 is up-to-date.
2021-05-04 06:18:17.939771: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-05-04 06:18:17.957687: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2021-05-04 06:18:17.957793: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (d2cf6aaade94): /proc/driver/nvidia/version does not exist
2021-05-04 06:18:17.958555: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-05-04 06:18:17.998594: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2299995000 Hz
2021-05-04 06:18:17.999018: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x557caec8dc00 executing computations on platform Host. Devices:
2021-05-04 06:18:17.999173: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
Creating initial network configuration as current best.
Storing checkpoint to '/content/drive/MyDrive/finetunemodel3/best.ckpt'
2021-05-04 06:18:27.540572: W tensorflow/core/grappler/optimizers/implementation_selector.cc:310] Skipping optimization due to error while loading function libraries: Invalid argument: Functions '__inference___backward_standard_lstm_7835_8318' and '__inference___backward_standard_lstm_7835_8318_specialized_for_StatefulPartitionedCall_at___inference_distributed_function_8986' both implement 'lstm_a336ff9d-aac9-4c7a-9927-3c038ff38c4b' but their signatures do not match.
2021-05-04 06:20:11.112456: W tensorflow/core/grappler/optimizers/implementation_selector.cc:310] Skipping optimization due to error while loading function libraries: Invalid argument: Functions '__inference_standard_lstm_840_specialized_for_bidirectional_backward_lstm_StatefulPartitionedCall_at___inference_keras_scratch_graph_3706' and '__inference_standard_lstm_840' both implement 'lstm_b5cc6707-5dd7-40a5-b213-41aa3a47d5e5' but their signatures do not match.
#00000100: loss=701.42822296 ler=0.71428573 dt=1.07047738s
  PRED: '‫ةلا ف عم ووو نلا يف مل ا عم يلا ةلا‬'
  TRUE: '‫القيمة التي مع نظامى لم في اليجن ووجود معبد في الحسينة .‬'
#00000200: loss=549.87046524 ler=0.73214287 dt=0.98782739s
  PRED: '‫ب\اي ايي ل ناا ن احب ا\ب اميا ل ا ن‬'
  TRUE: '‫نسب نفس الغاز في هواء الشهيق نتيجة التفاعلات الكيميائية.‬'
#00000300: loss=304.49510586 ler=0.82142858 dt=0.96837899s
  PRED: '‪‬'
  TRUE: '‫مما نتج عنه هـ تكون هذا الوقود .‬'
#00000400: loss=226.66267292 ler=0.86607143 dt=0.93520158s
  PRED: '‪‬'
  TRUE: '‫قراءتها . واشتكى ديفيد بادييل '' وكأنها معادلة حسابية '' ،''فهم يحاولون أن‬'
#00000500: loss=207.57152979 ler=0.89285715 dt=0.94284931s
  PRED: '‪‬'
  TRUE: '‫ربما يكون مرده إلى تغيرات حدثت في أنشطة الشمس، وليس فقط فيما يسمى‬'
#00000600: loss=217.19940674 ler=0.91071429 dt=0.98446298s
  PRED: '‪‬'
  TRUE: '‫والعمل على تلافي تواحي القصور في الهياكل الاساسية . الخدمات لاساسية. وتنمية المناطق‬'
#00000700: loss=214.12136238 ler=0.92346939 dt=1.06572494s
  PRED: '‪‬'
  TRUE: '‫قديماَ جداً'' في تاريخ بلاد العرب القديمة‬'
#00000800: loss=216.17247231 ler=0.93303572 dt=1.02804190s
  PRED: '‪‬'
  TRUE: '‫الشفهي، والقراءة، والكتابة والحساب ،والإستقلال التدريجي في سلوكهم في بيئتهم المحيطة بهم‬'
#00000900: loss=223.00798111 ler=0.94047619 dt=1.06393582s
  PRED: '‪‬'
  TRUE: '‫تعد الكتله الحيويه والوقود الحيوى من مصادد الطاقه المتجدده تساهم بشكل‬'
#00001000: loss=202.56568769 ler=0.94642857 dt=0.94814075s
  PRED: '‪‬'

when I run the evaluation command it give me !calamari-eval --gt *.gt.txt : the Error is~=100%

Resolving files
Loading GT: 100% 1424/1424 [05:53<00:00,  4.02it/s]
Loading Prediction: 100% 1424/1424 [00:01<00:00, 990.92it/s]
Evaluation: 100% 1424/1424 [00:00<00:00, 19612.81it/s]
Evaluation result
=================

Got mean normalized label error rate of 100.00% (87988 errs, 87991 total chars, 88020 sync errs)
GT       PRED     COUNT    PERCENT   
{ال هحور هب يجنملا لمعلا يف دهتجم هرخآلا بلاط كلذكو} {}              1      0.06%
{لب . ةقيوازتل حفصتلا هتياغ نوكت الأ اذه انباتك يف رظانلل يغبني دقو} {}              1      0.07%
{جزتمت ،لبجلا ةيدوأ يفو ،لبجلا فورج ىلع دباعملا رشتنت .ةعئارلا ءاقرزلا ءامسلاب} {}              1      0.09%
{نأ نودقتعي مهف ،رخآ يأر مهل اهءانيأ نأ ريغ ،يذوبلا وابيشت دبعم} {}              1      0.07%
{اديدحتو ،لعفلاب ةعبس زونك نم اهمسا تذخأ زونك ةعبسلا ةدلبلا هذه} {}              1      0.07%
{ةيبرعلا ىلإ اهتمجرت نكمي ىاليف لاثمت يه} {}              1      0.04%
{ةيقافتا فدهت ، رحصتلا نم ةبرتلاو يضارألا ةيامح ىلإ ةفاضإلابو} {}              1      0.07%
{ةمدقتملا نادلبلا مزتلتو . رقفلا ةبراحم ىلإ ًاضيأ رحصتلا ةحفاكم} {}              1      0.07%
{نع ةرثأتملا نادلبلا دوهج معدب ، ةيقافتإلا هذه بجومب ، ومنلا} {}              1      0.07%
{نواعتلا راطإ يف ةيفاكلا ةينقتلاو ةيلاملا ةدعاسملا اهحنم قيرط} {}              1      0.07%
The remaining but hidden errors make up 99.32%

Thanks for your help

ChWick commented 3 years ago

The evaluation shows that the prediction result is empty (GT with content, PRED empty brackets).

There might be 2 Reasons:

It seems that the default Arabic model is already capable to do a (bad) prediction. Test to call calamari-predict with the default model and run the evaluation afterwards. This should result in a high error rate that is high, but certainly smaller than 100%

Tailor2019 commented 3 years ago

Hi, @ChWick Sorry for the delay in reply I'm using default number for the epochs=100. After the training of Calamari on my database I use this command for prediction:

calamari-predict --checkpoint 'directory to the best model/best.ckpt' --files "*.png" --output_dir "/direction to the outp​ut folder

this the output

Prediction:   0% 0/6472 [00:00<?, ?it/s]2021-05-15 16:02:09.892058: W tensorflow/core/grappler/optimizers/implementation_selector.cc:310] Skipping optimization due to error while loading function libraries: Invalid argument: Functions '__inference_cudnn_lstm_with_fallback_1540' and '__inference_standard_lstm_1427_specialized_for_bidirectional_1_forward_lstm_1_StatefulPartitionedCall_at___inference_keras_scratch_graph_2503' both implement 'lstm_4ef3c18f-c348-4e14-acac-2d9096fc2ae6' but their signatures do not match.
Prediction: 100% 6472/6472 [3:23:15<00:00,  1.88s/it]
Prediction of 1 models took 12196.787104845047s
Average sentence confidence: 79.95%
All files written

when I run this command :

calamari-eval --gt *.gt.txt  --checkpoint 'directory to the best checkpoint/best.ckpt'            

this is the output

`Loading GT:  15% 959/6472 [03:53<21:57,  4.19it/s]',
 'Loading GT:  15% 960/6472 [03:53<20:25,  4.50it/s]',
 'Loading GT:  15% 961/6472 [03:54<18:13,  5.04it/s]',
 'Loading GT:  15% 962/6472 [03:54<19:02,  4.82it/s]',
 'Loading GT:  15% 963/6472 [03:54<20:42,  4.43it/s]',
 'Loading GT:  15% 964/6472 [03:54<20:44,  4.43it/s]',
 'Loading GT:  15% 965/6472 [03:55<22:23,  4.10it/s]',
 'Loading GT:  15% 966/6472 [03:55<22:47,  4.03it/s]',
 'Loading GT:  15% 967/6472 [03:55<21:50,  4.20it/s]',
 'Loading GT:  15% 968/6472 [03:55<23:11,  3.96it/s]',
 'Loading GT:  15% 969/6472 [03:56<21:20,  4.30it/s]',
 'Loading GT:  15% 970/6472 [03:56<21:53,  4.19it/s]',
 'Loading GT:  15% 971/6472 [03:56<22:24,  4.09it/s]',
 'Loading GT:  15% 972/6472 [03:56<22:31,  4.07it/s]',
 'Loading GT:  15% 973/6472 [03:57<22:19,  4.10it/s]',
 'Loading GT:  15% 974/6472 [03:57<23:58,  3.82it/s]',
 'Loading GT:  15% 975/6472 [03:57<23:52,  3.84it/s]',
 'Loading GT:  15% 976/6472 [03:57<23:11,  3.95it/s]',
 'Loading GT:  15% 977/6472 [03:58<21:45,  4.21it/s]',
 'Loading GT:  15% 978/6472 [03:58<22:13,  4.12it/s]',
 'Loading GT:  15% 979/6472 [03:58<21:43,  4.21it/s]',
 'Loading GT:  15% 980/6472 [03:58<21:30,  4.26it/s]',
 'Loading GT:  15% 981/6472 [03:59<21:25,  4.27it/s]',
 'Loading GT:  15% 982/6472 [03:59<20:35,  4.44it/s]',
 'Loading GT:  15% 983/6472 [03:59<23:17,  3.93it/s]',
 'Loading GT:  15% 984/6472 [03:59<20:37,  4.43it/s]',
 'Loading GT:  15% 985/6472 [03:59<21:22,  4.28it/s]',
 'Loading GT:  15% 986/6472 [04:00<22:24,  4.08it/s]',
 'Loading GT:  15% 987/6472 [04:00<20:23,  4.48it/s]',
 'Loading GT:  15% 988/6472 [04:00<18:55,  4.83it/s]',
 'Loading GT:  15% 989/6472 [04:00<18:31,  4.93it/s]',
 'Loading GT:  15% 990/6472 [04:00<17:46,  5.14it/s]',
 'Loading GT:  15% 991/6472 [04:01<19:53,  4.59it/s]',
 'Loading GT:  15% 992/6472 [04:01<20:17,  4.50it/s]',
 'Loading GT:  15% 993/6472 [04:01<20:53,  4.37it/s]',
 'Loading GT:  15% 994/6472 [04:01<21:36,  4.22it/s]',
 'Loading GT:  15% 995/6472 [04:02<22:58,  3.97it/s]',
 'Loading GT:  15% 996/6472 [04:02<23:22,  3.90it/s]',
 'Loading GT:  15% 997/6472 [04:02<23:13,  3.93it/s]',
 'Loading GT:  15% 998/6472 [04:02<22:22,  4.08it/s]',
 'Loading GT:  15% 999/6472 [04:03<21:48,  4.18it/s]',
 ...]`

Thanks for your help what rectification can I do to have a best result of recognition?

ChWick commented 3 years ago

I see that you use the output_dir flag in the calamari-predict script. In this case you must manually specify the prediction files for calamari-eval:

calamari-eval --gt *.gt.txt --pred PATH_TO_PREDS/*.pred.txt --checkpoint 'directory to the best checkpoint/best.ckpt' 

I highly recommend to drop output_dir, since this simplifies the mapping of .gt.txt and .pred.txt files and you do not have to adapt the calamari-eval call.

The output of the calamar-eval script shows that the ground truth is loading (btw. 4 lines per second is somewhat very slow...). Maybe test a reduced set of ground truth files first, to verify everything is running (something like calamari-eval --gt 0001*.gt.txt)

P.S. you are also using an old version of calamari (probably 1.0.3), consider updateing to 2.1.1 at some point.

Tailor2019 commented 3 years ago

Thanks for your reply! @ChWick For of the new veersion of Calamari ""pip install calamari_ocr==2.0.2" but the same error occured in this issuehttps://github.com/Calamari-OCR/calamari/issues/221 when I use this command ""!calamari-train --files '*.png' --weights 'directoryto/arabic_models/3.ckpt' --validation '/directoryto/Validate' "" I don't percieve how can I change this " elif t == 'BIDI_NORMALIZER': to elif t == 'BIDI_NORMALIZER': conv = {'BIDI_AUTO': None, 'BIDI_LTR': 'L', 'BIDI_RTL': 'R'} flat.append(text_processor("BidiTextProcessor", args={'bidi_direction': conv[p.get('bidiDirection', 'BIDI_AUTO')]})) as @andbue suggested in the issue signed earlier. Please help me to resolve the problem with this versin may be it will better for my case. I try the others advices and give you the results Thank you so much!

andbue commented 3 years ago

This is fixed with 8f72bc3ac12c6c077e0e5936de8b114add2c1902.

For a tutorial how to train different kinds of models (including Arabic), have a look at the Jupyter notebook here: https://github.com/andbue/calamari_demo/blob/main/calamari_train.ipynb. This can be run on Google Colab, just make sure you select a GPU in the runtime settings.

ChWick commented 3 years ago

use version 2.1.1 pip install -U calamari-ocr or pip install calamari-ocr==2.1.1 (note use a dash - not a underscore _ in the package name)

As @andbue stated, this was already fixed.

Tailor2019 commented 3 years ago

@ChWick @andbue Thanks so much ! I will try your solutions and give you the results!

Tailor2019 commented 3 years ago

Hi! @ChWick After installing Calamari2.1.1, I try this command line

`!calamari-train --files *.png --weights "directoryto/arabic_models/3.ckpt --output_dir "directory to the outputfolder"

(I'm in the directory where I have .png images).(I'm using --output_dir because I train Calamari using the same database with different models) ` But it returns this error

`2021-05-18 16:48:01.791625: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
CRITICAL 2021-05-18 16:48:03,561             tfaip.util.logging: Uncaught exception
Traceback (most recent call last):
  File "/usr/local/bin/calamari-train", line 8, in <module>
    sys.exit(run())
  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/scripts/train.py", line 16, in run
    main(parse_args())
  File "/usr/local/lib/python3.7/dist-packages/calamari_ocr/scripts/train.py", line 39, in parse_args
    return parser.parse_args(args).trainer
  File "/usr/local/lib/python3.7/dist-packages/paiargparse/main_parser.py", line 94, in parse_args
    raise UnknownArgumentError(f"Unknown Arguments {' '.join(argv)}. Possible alternatives:{''.join(help_str)}")
paiargparse.dataclass_parser.UnknownArgumentError: Unknown Arguments --files name-1000.png .......
--files. Alternative: --codec
    name1000.png. Alternative: --version
    name1001.png. Alternative: --optimizer`

Thanks so much for your help

andbue commented 3 years ago

Most command line arguments have changed with Calamari 2.1. Have a look at the docs or at the Jupyter notebook I posted earlier.

ChWick commented 3 years ago

Basically it is

calamari-train --train.files *.png --warmstart.model directoryto/arabic_models/3.ckpt --trainer.output_dir "directory to the outputfolder"
Tailor2019 commented 3 years ago

Hi! @ChWick in the training there is two argumments indicating the ""loss"":the loss and the ctc-loss what formula you use to calculate each one and where is lines of code calculate them. Why using 2 loss functions for this version of Calamari? Thanks a lot for your continuous help!

Tailor2019 commented 3 years ago

Hi! @ChWick @andbue when I using calamari_ocr==1.0 I use this function to change the direction of text TextProcessorParams.BIDI_LTR: 'R', TextProcessorParams.BIDI_RTL: 'L', TextProcessorParams.BIDI_AUTO: None, But for the latest version how can I proceed to allow calamari to start reading from(right to left) Thanks so much!

andbue commented 3 years ago

According to the docs, you probably would set --data.pre_proc.processors.3.bidi_direction RTL for the preprocessor (i.e. the processor that is run on the text entering the training or evaluation process) and data.post_proc.processors.2.bidi_direction RTL for the postprocessor (i.e. the processor that is run after text comes out of the model). The default option, AUTO should also work in cases where there are only Arabic characters in the line. It might fail, however, if there are Arabic and non-Arabic characters combined.

If you're training models on Arabic, don't be confused by the mixed up printout of examples after each epoch, this is just a bug that does not impact model performance (see #239).

Tailor2019 commented 3 years ago

Hi, @ChWick @andbue For me the dataset is by default divided to 3parts(part for training, evaluation and test it is possible to use this command calamari-train --train.files *.png --warmstart.model directoryto/arabic_models/3.ckpt --trainer.output_dir "directory to the outputfolder" without these 2 flags: --data.pre_proc.processors.3.bidi_direction RTL data.post_proc.processors.2.bidi_direction RTL in the middle of training there is generation of a best.ckpt model it is possible to use this model for the evaluation or I must waiting until the training finish then doing the evaluation? Thank you for your help!

ChWick commented 3 years ago

@Tailor2019 For the first question: There is only one loss, keras/tensorflow automatically adds and additional "loss" as the sum of all losses. Therefore, here "ctc-loss" == "loss" You can use any intermediate model (best.ckpt) for evaluation, during training, you dont have to wait. Furthermore you should consider using your validation split during training to automatically determine the best model. Add --val.files val_dir/*.png --train.files *.png

Tailor2019 commented 3 years ago

Hi! Thanks for your replay! @ChWick I added the flag of my validation images to the command of training there is this parameter ""Better value of val_CER found. old" there is no indication of the best checkpoint. In this case how can I distinguish the best chckpoint in term of CER" Also during the training there generation of 3 folders one for validation and other for train and the third for checkpoints how can I use these checkpoints to evaluate my system even that they don't have .ckpt ? Furthermore, from the first there is the generation of the (best.ckpt) is this model sustain an update during the training? Thanks for your help!

ChWick commented 3 years ago

the validation and train directories comprise data for the Tensorboard that can be used to monitor the training. the checkpoints directory comprises the checkpoints if you want to resume training the current best model is tracked as best.ckpt (this is what you want). So you should/must use best.ckpt to evaluate the current best model

Tailor2019 commented 3 years ago

Hello! Thanks very much! @ChWick I hope to resume the training using one of the 20 checkpoint (My traing of my arabic database generate 20 checkpoints in the first stage) Each checkpoint folder containing these files (trainer_params.json and a folder named ""variables"" that contains 3 files (variables.indes ;variables.data-00000-of-00001;checkpoint) but when I use this checkpoint for any command it returns error that this file isn't a .ckpt (when I download this checkpoint it is of type file not a ckpt) For this reason please what kind of command can I use for resuming my training using one checkpoint from these collection. Thanks so much!

ChWick commented 3 years ago

(Sorry for the late reply) Training can be resumed by providing a valid checkpoint to calamari-resume-training:

calamari-resume-training OUTPUT_DIR/checkpoint/checkpoint_XXX/trainer_params.json

I added a section in the docs about that.