OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

OSD on line level, recognition by loading script or lang from PAGE #175

Closed bertsky closed 3 years ago

bertsky commented 3 years ago

In partial fulfillment of #69 – but I'm afraid osd.traineddata is just too bad for script detection to make this work reliably. Latin vs Cyrillic vs Arabic vs Chinese etc might be easy, but Greek for example does not work on the line level...

Nevertheless, once we do have script / font detectors on the line level, we could make this work.

codecov[bot] commented 3 years ago

Codecov Report

Merging #175 (858e968) into master (a3647ea) will decrease coverage by 0.64%. The diff coverage is 30.35%.

:exclamation: Current head 858e968 differs from pull request most recent head 27219ac. Consider uploading reports for the commit 27219ac to get more accurate results Impacted file tree graph

@@            Coverage Diff             @@
##           master     #175      +/-   ##
==========================================
- Coverage   31.38%   30.74%   -0.65%     
==========================================
  Files          12       12              
  Lines        1252     1376     +124     
  Branches      289      319      +30     
==========================================
+ Hits          393      423      +30     
- Misses        784      869      +85     
- Partials       75       84       +9     
Impacted Files Coverage Δ
ocrd_tesserocr/deskew.py 13.39% <0.00%> (-1.61%) :arrow_down:
ocrd_tesserocr/recognize.py 30.19% <33.33%> (-0.54%) :arrow_down:
ocrd_tesserocr/config.py 81.81% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a3647ea...27219ac. Read the comment docs.

bertsky commented 3 years ago

The mapping can only be as good as osd is in detecting script and language, of course,

Now goes beyond that: you can use e.g. -P script_model '{ "Latn - Latin": "lat+Latin", "Grek - Greek": "grc+Greek" }' -P auto_model true -P model lat+Latin+grc+Greek to have it try:

  1. if @primaryScript="Latn - Latin", then activate lat+Latin
  2. elif @primaryScript="Grek - Greek", then activate grc+Greek
  3. otherwise try each loaded model individually and pick the best-scoring one per segment

but it's a step towards dynamic model selection and we can use the patterns here in other processors.

Yes, but that kind of dynamics will look quite different for each processor implementation (and no other engine has the model variety of Tesseract). I wish we could express any of that on the workflow level...

bertsky commented 3 years ago

Note: depends on OCR-D/core#699

bertsky commented 3 years ago

@kba, I still had to update our deployment rules for the resmgr changes. CI is now only failing because of the dependency on https://github.com/OCR-D/core/pull/699 – should I wait for your merge (and rebuild of ocrd/core on DH), or rather update to core's etree branch here?