Calamari-OCR / calamari_models

Pretrained mixed models to be used with Calamari.
MIT License
55 stars 17 forks source link

fraktur_19th_century vs github.com/qurator-spk/train-calamari-gt4histocr #3

Closed jbarth-ubhd closed 3 years ago

jbarth-ubhd commented 4 years ago

Dear reader, do you have any details about the model in fraktur_19th_century?

Is it based on gt4histocr ground truth?

Kind regards.

chreul commented 4 years ago

it is mainly based on gt4histocr GT but also on Fraktur19 data from other freely available sources (archiscribe and jze). the following training pipeline was applied: each voter used a different out-of-domain mixed model as a starting point (trained on various subsets of gt4histocr). then, after training on all available Fraktur19 data using data augmentation, a final refinement step was performed, limiting the number of lines per source to a maximum of 50. a padding of 3 rows of white pixels was added to the top and bottom of each line, if not already present. the effect on lines segmented without this padding has not been thoroughly evaluated, yet. maybe training a new/additional ensemble using no/mixed padding would be sensible. feedback would be dearly appreciated.