recognize: use primaryScript or TextStyle to load model

bertsky commented 5 years ago

In the current state, the OCR model has to be selected in the fixed parameter JSON for the whole pipeline (all pages, all regions, all lines). We should at least offer a setting like dynamic that instead looks into ...

mods:language of the workspace's METS file
@primaryScript and @secondaryScript of the elements to be processed (or their parents), depending on textequiv_level
TextStyle/@fontFamily of the elements to be processed (or their parents), depending on textequiv_level – as described by the spec

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

bertsky commented 3 years ago

...and combines this information somehow to select one of the predefined models. (Predefined could include custom built models, though. So maybe this must be more than a single new value in the parameter file.)

One idea to get this configured would be a (partial) mapping between ISO 15924 (script) / ISO 639 (language) / OCR-D (font) to Tesseract models (in the usual notation) as parameter. A mapping for the empty string could become the manual fall-back. Matches in multiple categories (language/script/font), or across multiple levels (page/region/line), could be mixed via + in the result.

(Mappings of type: object are allowed syntactically in OCR-D's parameter JSON now.)

So for example, I could first run ocrd-typegroups-classifier (for font detection) and/or ocrd-tesserocr-deskew (for script detection), and then call ocrd-tesserocr-recognize -P model_map '{ "German": "deu+Latin", "deu": "deu+Latin", "Latin": "lat+Latin", "Latn": "Latin", "Latf": "GT4HistOCR+ONB+Fraktur+frk", "Greek": "grc+ell+Greek", "Grek": "Greek", "Hebr": "Hebrew", "": "eng" }'.

Probably we also should introduce some model_conf threshold here.

Unfortunately, due to Tesseract's API, the implementation would need to re-initialize Tesseract each time a segment has a different script/language/font annotation than the previous. But one could control this performance/quality trade-off by running detection on regions or pages only.

(A problem that first needs to be addressed though is the formalization of script and language identifications in PAGE.)

bertsky commented 3 years ago

Fixed by #175 (completely)

OCR-D / ocrd_tesserocr

recognize: use primaryScript or TextStyle to load model #69