Open Tailor2019 opened 2 years ago
c2_arabic in calamari_models_experimental is only trained on real data (historical texts, sometimes idiosyncratic transcription guidelines, no normalization). If you want to create synthetic data, you could have a look at https://github.com/Belval/TextRecognitionDataGenerator – maybe this could be extended for Arabic language data?
Thanks for your help! @ChWick @andbue What about c1_arabic what type of data used? Can you explain me what strategy of data augmentation used to obtain these models? Thanks a lot for your continuous help!
Sorry, totally forgot about c1_arabic: it's a bit older, the training set was a bit smaller and line contours were sometimes not segmented in an ideal way.
As you can see from the json, both models are trained without augmentation. Since I had far more than 100k lines, I did not think it was necessary back than. Now, I would set it to n_augmentations=5.
@andbue Thanks a lot! but what you mean by 100k lines==100*100 or what.? n_augmentations=5 will multiply the size of the data by 5 or what? thanks in advance
As in: 100.000 transcribed line images. For n_augmentations, see https://calamari-ocr.readthedocs.io/en/latest/doc.command-line-usage.html#data-augmentation.
@andbue Excuse me this number of line 100k is the same for training latin and arabic or what difference? Thanks
GT4HistOcr, for example, is more than 300k lines. The other datasets are smaller.
@andbue thanks Please how can access to the paper GT4HistOcr https://zenodo.org/record/1344132#.YaeYiNDMLIU I didn't find it with its DOI Thanks in advance!
Thanks so much!
c2_arabic in calamari_models_experimental is only trained on real data (historical texts, sometimes idiosyncratic transcription guidelines, no normalization). If you want to create synthetic data, you could have a look at https://github.com/Belval/TextRecognitionDataGenerator – maybe this could be extended for Arabic language data?
Hello!
@andbue
I test this generator for an arabic image but the problem is that it can't display the arabic text after generation of 1000 images
as here
Please how can I solve this problem and display corectly the content of images
Thanks in advance!
You have to provide a font that is able to render Arabic text. Also, you have to set --word_split
in order to get ligatures instead of single letters. Finally, you'd have to change the code in https://github.com/Belval/TextRecognitionDataGenerator/blob/ab83b94fd10ecdace77c77fddb2727d8e4c85289/trdg/computer_text_generator.py#L41 to output the text from right to left instead of left to right (and eventually check for problems with bidirectional text).
Thanks @andbue But what kind of contribution of this generator to the text(I mean what specific caracteristics of these images generated by this generator) when applying this tool directly (don't add the background option or blurring ...) Do you use for your data a generator for training by synthetic data? Thanks a lot in advance!
As I said before, I did not use synthetic data at all.
Hello! @ChWick @andbue Please can you give a prototype from synthetic arabic data you used for training if you don't mind? Thanks in advance!