Closed bertsky closed 3 years ago
...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)
One idea to get this configured would be a (partial) mapping between ISO 15924 (script) / ISO 639 (language) / OCR-D (font) to Tesseract models (in the usual notation) as parameter. A mapping for the empty string could become the manual fall-back. Matches in multiple categories (language/script/font), or across multiple levels (page/region/line), could be mixed via +
in the result.
(Mappings of type: object
are allowed syntactically in OCR-D's parameter JSON now.)
So for example, I could first run ocrd-typegroups-classifier
(for font detection) and/or ocrd-tesserocr-deskew
(for script detection), and then call ocrd-tesserocr-recognize -P model_map '{ "German": "deu+Latin", "deu": "deu+Latin", "Latin": "lat+Latin", "Latn": "Latin", "Latf": "GT4HistOCR+ONB+Fraktur+frk", "Greek": "grc+ell+Greek", "Grek": "Greek", "Hebr": "Hebrew", "": "eng" }'
.
Probably we also should introduce some model_conf
threshold here.
Unfortunately, due to Tesseract's API, the implementation would need to re-initialize Tesseract each time a segment has a different script/language/font annotation than the previous. But one could control this performance/quality trade-off by running detection on regions or pages only.
(A problem that first needs to be addressed though is the formalization of script and language identifications in PAGE.)
Fixed by #175 (completely)
In the current state, the OCR model has to be selected in the fixed parameter JSON for the whole pipeline (all pages, all regions, all lines). We should at least offer a setting like
dynamic
that instead looks into ...mods:language
of the workspace's METS file@primaryScript
and@secondaryScript
of the elements to be processed (or their parents), depending ontextequiv_level
TextStyle/@fontFamily
of the elements to be processed (or their parents), depending ontextequiv_level
– as described by the spec...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)