OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
38 stars 11 forks source link

override instead of wrap TesserocrRecognize in other processors #191

Closed bertsky closed 1 year ago

bertsky commented 1 year ago

fixes #190

Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.

bertsky commented 1 year ago

Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.

Ah, it actually did! (We just had no event on this repo trigger the CI since the last get_processor / run_processor changes in core.)

bertsky commented 1 year ago

Turns out I had to do a lot more to fix the CI:

bertsky commented 1 year ago

Ok, I ran into https://github.com/OCR-D/core/issues/998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...) pattern. But I don't want either, so let's do a workaround for now.

codecov[bot] commented 1 year ago

Codecov Report

Merging #191 (148d015) into master (c10f94d) will decrease coverage by 26.98%. The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #191       +/-   ##
==========================================
- Coverage   26.97%   0.00%   -26.98%     
==========================================
  Files          11      12        +1     
  Lines        1416    1377       -39     
  Branches      333     346       +13     
==========================================
- Hits          382       0      -382     
- Misses        981    1371      +390     
+ Partials       53       6       -47     
Impacted Files Coverage Δ
ocrd_tesserocr/binarize.py 0.00% <0.00%> (-15.12%) :arrow_down:
ocrd_tesserocr/config.py 0.00% <ø> (-100.00%) :arrow_down:
ocrd_tesserocr/crop.py 0.00% <0.00%> (-14.29%) :arrow_down:
ocrd_tesserocr/deskew.py 0.00% <0.00%> (-13.52%) :arrow_down:
ocrd_tesserocr/fontshape.py 0.00% <0.00%> (-17.83%) :arrow_down:
ocrd_tesserocr/recognize.py 0.00% <0.00%> (-25.38%) :arrow_down:
ocrd_tesserocr/segment.py 0.00% <0.00%> (-36.00%) :arrow_down:
ocrd_tesserocr/segment_line.py 0.00% <0.00%> (-96.16%) :arrow_down:
ocrd_tesserocr/segment_region.py 0.00% <0.00%> (-96.43%) :arrow_down:
ocrd_tesserocr/segment_table.py 0.00% <0.00%> (-34.62%) :arrow_down:
... and 1 more

... and 1 file with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

bertsky commented 1 year ago

@kba if you make a release out of https://github.com/OCR-D/core/pull/999 today, I'll revert b357217 here and require that new core version.

kba commented 1 year ago

@kba if you make a release out of OCR-D/core#999 today, I'll revert b357217 here and require that new core version.

v2.47.0 has been released now.

stweil commented 1 year ago

v2.47.0 has been released now.

Too early, because we need https://github.com/OCR-D/core/pull/1004.

bertsky commented 1 year ago

v2.47.0 has been released now.

Thanks! I'll adapt...

Too early, because we need OCR-D/core#1004.

This is not about Docker, but Python API.

kba commented 1 year ago

v2.47.0 has been released now.

Too early, because we need OCR-D/core#1004.

Hotfix incoming.

bertsky commented 1 year ago

Ok, I ran into OCR-D/core#998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...) pattern. But I don't want either, so let's do a workaround for now.

v2.47.0 has been released now.

Thanks! I'll adapt...

Odd. CI failure did come back:

  File "/home/circleci/project/ocrd_tesserocr/cli.py", line 23, in ocrd_tesserocr_segment_region
    return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 116, in ocrd_cli_wrap_processor
    run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 95, in run_processor
    instance_caching=instance_caching
  File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 329, in get_processor
    parameter=parameter
  File "/home/circleci/project/ocrd_tesserocr/segment_region.py", line 17, in __init__
    self.parameter['overwrite_segments'] = self.parameter['overwrite_regions']
KeyError: 'overwrite_regions'

That's despite ocrd==2.47.1. Where does the empty ocrd_tool come from???

bertsky commented 1 year ago

That's despite ocrd==2.47.1. Where does the empty ocrd_tool come from???

Aha! It turns out that my https://github.com/OCR-D/core/pull/999 was premature: we also pass an unnecessary ocrd_tool in other places:

Now, the big story here is: none of these places is needed (or used) – neither currently, nor for OCR-D/core#974 nor OCR-D/core#884!

joschrew commented 1 year ago

Other questions (hopefully it is ok to ask it here):

bertsky commented 1 year ago

@joschrew

  • What is the reason to do this kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL]) instead of doing this: kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL] (this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?

The reason is that here in ocrd_tesserocr many processors are basically just re-parameterizations (simplifications) of the "goliath" ocrd-tesserocr-recognize. This helps the user cope with complexity (and avoids code duplication while sustaining the older single-step processors) – see README.

To implement this, you have to know that ocrd.Processor does parameter parsing/instantiation/validation on the constructor: https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/base.py#L146-L150

So you need to

  1. enter with the (possibly empty) params for the wrapper processor,
  2. let them validate (instantiate) against the wrapper processor's tool json,
  3. translate them to the TesserocrRecognize's tool json,
  4. re-validate (instantiate) against that tool json.

Now, in the old pattern, where every ocrd.Processor's subclass constructor just overwrites the kwarg for ocrd_tool with its own local tool json (from pkg_resources), we cannot bring TesserocrRecognize to accept any foreign tool json. Hence the setdefault pattern, which IMO is also better in general.

  • I tried to get the idea behind the ocrd-tool-setting. Is it right that a processor-constructor is supposed to set the ocrd_tool always before calling the super-constructor?

Yes, it must. Otherwise, the parameter validator in the superclass won't see the actual cmdline values. Also, the non-processing contexts (help, dump json, show resources) in the superclass constructor need to see the actual tool json.

I have only viewed a few example processors and I am wondering if this (setting ocrd-tool in a processors __init__() before calling the super constructor) is something like a rule

We could of course change the API in core to separate the parameter instantiation, but that would entail changing all processors (lots of diverse codebases with various maintainers). We have to do that for the new process_page API soon anyway, so it's a good time to think about alternatives...

(In Java you would probably use something like an abstract method in the processor-base-class?)?

Yes. We could define a method (say) init which (in the subclasses) loads the tool json, and gets called (by the superclass) before most of the work done in the current superclass constructor.

For the processor delegation pattern then, since in the new API, we will also have a setup for the processing context of a processor instance, we could do parameter re-instantiation for the new superclass there. So the subclasses would simply have to call super().setup() and then do their own ParameterValidator.validate.

mikegerber commented 1 year ago

With this PR (and ocrd 2.48.0) I seem to have problems relating to $TESSDATA_PREFIX:

$ ocrd-tesserocr-segment-region -I OCR-D-IMG-BINPAGE-sauvola -O OCR-D-TEST-TEST-TEST
18:37:10.199 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-tesserocr-segment-region'
Traceback (most recent call last):
  File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
    processor.process()
[... shortened traceback, so remove noise (turned out to be notabug)]
RuntimeError: Failed to init API, possibly an invalid tessdata path:
$ echo $TESSDATA_PREFIX
/home/mike.gerber/.local/share/ocrd-resources/ocrd-tesserocr-recognize
$ ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata  deu.traineddata  frk.traineddata
kba commented 1 year ago
ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata  deu.traineddata  frk.traineddata

You need osd.traineddata I think. Can you try ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata and check if that fixes it?

mikegerber commented 1 year ago

Ah. Never had it working on this system, and I wrongly assumed it was a new bug.

I copied everything from /usr/share/.../tessdata for now, that fixed ocrd-tesserocr-segment-region! This PR's ocrd-tesserocr-recognize also works for me. OCR-D 2.48.0.

bertsky commented 1 year ago

@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test :roll_eyes: – do you have any idea what's going on?

kba commented 1 year ago

@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test 🙄 – do you have any idea what's going on?

Really confusing, I think the problem is that the last time code coverage was calculated on master, only four of the five python versions uploaded the results to codecov successfully. Now it cannot match the coverage reports to the four it last had in master. I cannot verify this because CircleCI does not retain logs of the last run on master.

In other words, I think this will clear up once we merge to master. The code coverage calculation looks right:

======================= 4 passed, 38 warnings in 22.72s ========================
make[1]: Leaving directory '/home/circleci/project'
coverage report
Name                               Stmts   Miss Branch BrPart  Cover
--------------------------------------------------------------------
ocrd_tesserocr/__init__.py            10      0      0      0   100%
ocrd_tesserocr/binarize.py            85     72     34      0    11%
ocrd_tesserocr/config.py               3      0      0      0   100%
ocrd_tesserocr/crop.py               113     99     28      0    10%
ocrd_tesserocr/deskew.py             109     95     48      0     9%
ocrd_tesserocr/fontshape.py           96     79     34      0    13%
ocrd_tesserocr/recognize.py          861    515    500     79    35%
ocrd_tesserocr/segment.py             18      8      2      0    50%
ocrd_tesserocr/segment_line.py        19      0      2      1    95%
ocrd_tesserocr/segment_region.py      21      0      2      1    96%
ocrd_tesserocr/segment_table.py       19      9      2      0    48%
ocrd_tesserocr/segment_word.py        19      0      2      1    95%
--------------------------------------------------------------------
TOTAL                               1373    877    654     82    31%

(for python3.9)

bertsky commented 1 year ago

Understood. Thanks!

I'll merge and release then.