Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.03k stars 210 forks source link

prototype of synthetic data #298

Open Tailor2019 opened 2 years ago

Tailor2019 commented 2 years ago

Hello! @ChWick @andbue Please can you give a prototype from synthetic arabic data you used for training if you don't mind? Thanks in advance!

andbue commented 2 years ago

c2_arabic in calamari_models_experimental is only trained on real data (historical texts, sometimes idiosyncratic transcription guidelines, no normalization). If you want to create synthetic data, you could have a look at https://github.com/Belval/TextRecognitionDataGenerator – maybe this could be extended for Arabic language data?

Tailor2019 commented 2 years ago

Thanks for your help! @ChWick @andbue What about c1_arabic what type of data used? Can you explain me what strategy of data augmentation used to obtain these models? Thanks a lot for your continuous help!

andbue commented 2 years ago

Sorry, totally forgot about c1_arabic: it's a bit older, the training set was a bit smaller and line contours were sometimes not segmented in an ideal way.

As you can see from the json, both models are trained without augmentation. Since I had far more than 100k lines, I did not think it was necessary back than. Now, I would set it to n_augmentations=5.

Tailor2019 commented 2 years ago

@andbue Thanks a lot! but what you mean by 100k lines==100*100 or what.? n_augmentations=5 will multiply the size of the data by 5 or what? thanks in advance

andbue commented 2 years ago

As in: 100.000 transcribed line images. For n_augmentations, see https://calamari-ocr.readthedocs.io/en/latest/doc.command-line-usage.html#data-augmentation.

Tailor2019 commented 2 years ago

@andbue Excuse me this number of line 100k is the same for training latin and arabic or what difference? Thanks

andbue commented 2 years ago

GT4HistOcr, for example, is more than 300k lines. The other datasets are smaller.

Tailor2019 commented 2 years ago

@andbue thanks Please how can access to the paper GT4HistOcr https://zenodo.org/record/1344132#.YaeYiNDMLIU I didn't find it with its DOI Thanks in advance!

andbue commented 2 years ago

https://jlcl.org/content/2-allissues/2-heft1-2018/jlcl_2018-1_5.pdf

Tailor2019 commented 2 years ago

Thanks so much!

Tailor2019 commented 2 years ago

c2_arabic in calamari_models_experimental is only trained on real data (historical texts, sometimes idiosyncratic transcription guidelines, no normalization). If you want to create synthetic data, you could have a look at https://github.com/Belval/TextRecognitionDataGenerator – maybe this could be extended for Arabic language data?

Hello! @andbue I test this generator for an arabic image but the problem is that it can't display the arabic text after generation of 1000 images as here image Please how can I solve this problem and display corectly the content of images Thanks in advance!

andbue commented 2 years ago

You have to provide a font that is able to render Arabic text. Also, you have to set --word_splitin order to get ligatures instead of single letters. Finally, you'd have to change the code in https://github.com/Belval/TextRecognitionDataGenerator/blob/ab83b94fd10ecdace77c77fddb2727d8e4c85289/trdg/computer_text_generator.py#L41 to output the text from right to left instead of left to right (and eventually check for problems with bidirectional text).

Tailor2019 commented 2 years ago

Thanks @andbue But what kind of contribution of this generator to the text(I mean what specific caracteristics of these images generated by this generator) when applying this tool directly (don't add the background option or blurring ...) Do you use for your data a generator for training by synthetic data? Thanks a lot in advance!

andbue commented 2 years ago

As I said before, I did not use synthetic data at all.