Closed bertsky closed 3 years ago
Merging #175 (858e968) into master (a3647ea) will decrease coverage by
0.64%
. The diff coverage is30.35%
.:exclamation: Current head 858e968 differs from pull request most recent head 27219ac. Consider uploading reports for the commit 27219ac to get more accurate results
@@ Coverage Diff @@
## master #175 +/- ##
==========================================
- Coverage 31.38% 30.74% -0.65%
==========================================
Files 12 12
Lines 1252 1376 +124
Branches 289 319 +30
==========================================
+ Hits 393 423 +30
- Misses 784 869 +85
- Partials 75 84 +9
Impacted Files | Coverage Δ | |
---|---|---|
ocrd_tesserocr/deskew.py | 13.39% <0.00%> (-1.61%) |
:arrow_down: |
ocrd_tesserocr/recognize.py | 30.19% <33.33%> (-0.54%) |
:arrow_down: |
ocrd_tesserocr/config.py | 81.81% <0.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update a3647ea...27219ac. Read the comment docs.
The mapping can only be as good as
osd
is in detecting script and language, of course,
Now goes beyond that: you can use e.g. -P script_model '{ "Latn - Latin": "lat+Latin", "Grek - Greek": "grc+Greek" }' -P auto_model true -P model lat+Latin+grc+Greek
to have it try:
@primaryScript="Latn - Latin"
, then activate lat+Latin
@primaryScript="Grek - Greek"
, then activate grc+Greek
but it's a step towards dynamic model selection and we can use the patterns here in other processors.
Yes, but that kind of dynamics will look quite different for each processor implementation (and no other engine has the model variety of Tesseract). I wish we could express any of that on the workflow level...
Note: depends on OCR-D/core#699
@kba, I still had to update our deployment rules for the resmgr changes. CI is now only failing because of the dependency on https://github.com/OCR-D/core/pull/699 – should I wait for your merge (and rebuild of ocrd/core
on DH), or rather update to core's etree branch here?
In partial fulfillment of #69 – but I'm afraid osd.traineddata is just too bad for script detection to make this work reliably. Latin vs Cyrillic vs Arabic vs Chinese etc might be easy, but Greek for example does not work on the line level...
Nevertheless, once we do have script / font detectors on the line level, we could make this work.
implementstyle_model
(map fromTextStyle
alists to model name)xpath_model
/xpath_parameters
(map from PAGE-XML XPath queries to model / variable settings)auto_model
(try each loaded model and pick best scoring one)