bertsky commented 3 years ago

In partial fulfillment of #69 – but I'm afraid osd.traineddata is just too bad for script detection to make this work reliably. Latin vs Cyrillic vs Arabic vs Chinese etc might be easy, but Greek for example does not work on the line level...

Nevertheless, once we do have script / font detectors on the line level, we could make this work.

[x] update the PR to the current master
[x] ~~implement style_model (map from TextStyle alists to model name)~~
[x] implement xpath_model / xpath_parameters (map from PAGE-XML XPath queries to model / variable settings)
[x] implement auto_model (try each loaded model and pick best scoring one)

codecov[bot] commented 3 years ago

Codecov Report

Merging #175 (858e968) into master (a3647ea) will decrease coverage by 0.64%. The diff coverage is 30.35%.

:exclamation: Current head 858e968 differs from pull request most recent head 27219ac. Consider uploading reports for the commit 27219ac to get more accurate results

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   31.38%   30.74%   -0.65%     
==========================================
  Files          12       12              
  Lines        1252     1376     +124     
  Branches      289      319      +30     
==========================================
+ Hits          393      423      +30     
- Misses        784      869      +85     
- Partials       75       84       +9

Impacted Files	Coverage Δ
ocrd_tesserocr/deskew.py	`13.39% <0.00%> (-1.61%)`	:arrow_down:
ocrd_tesserocr/recognize.py	`30.19% <33.33%> (-0.54%)`	:arrow_down:
ocrd_tesserocr/config.py	`81.81% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a3647ea...27219ac. Read the comment docs.

bertsky commented 3 years ago

The mapping can only be as good as osd is in detecting script and language, of course,

Now goes beyond that: you can use e.g. -P script_model '{ "Latn - Latin": "lat+Latin", "Grek - Greek": "grc+Greek" }' -P auto_model true -P model lat+Latin+grc+Greek to have it try:

if @primaryScript="Latn - Latin", then activate lat+Latin
elif @primaryScript="Grek - Greek", then activate grc+Greek
otherwise try each loaded model individually and pick the best-scoring one per segment

but it's a step towards dynamic model selection and we can use the patterns here in other processors.

Yes, but that kind of dynamics will look quite different for each processor implementation (and no other engine has the model variety of Tesseract). I wish we could express any of that on the workflow level...

bertsky commented 3 years ago

Note: depends on OCR-D/core#699

bertsky commented 3 years ago

@kba, I still had to update our deployment rules for the resmgr changes. CI is now only failing because of the dependency on https://github.com/OCR-D/core/pull/699 – should I wait for your merge (and rebuild of ocrd/core on DH), or rather update to core's etree branch here?

OCR-D / ocrd_tesserocr

OSD on line level, recognition by loading script or lang from PAGE #175

Codecov Report