Calamari-OCR / calamari

Line based ATR Engine based on OCRopy
Apache License 2.0
1.04k stars 209 forks source link

Sample of training files #161

Closed vfbsilva closed 4 years ago

vfbsilva commented 4 years ago

Where can I find some samples or tutorials about how to create the training files? I want to use calamari to recover data from ID cards as the attached image. Is it feasible? sample

ChWick commented 4 years ago

Calamari is an ATR engine only. Its input is a text line that must be segmented in a previous step. Example files that can be used for training/prediction are located here: https://github.com/Calamari-OCR/calamari/tree/master/calamari_ocr/test/data The most simple way is to use pairs of line images and text files (e.g. https://github.com/Calamari-OCR/calamari/tree/master/calamari_ocr/test/data/uw3_50lines/train)

vfbsilva commented 4 years ago

Does the background of input text has to be white?

ChWick commented 4 years ago

In all of our use-cases the background was white, but in general the color could be arbitrary. However, I expect that a significantly higher amount of GT is required. Therefore, I recommend to binarize your input which should be straightforward on your ID cards: Grayscale -> Otsu should suffice.