Closed bertsky closed 1 year ago
Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.
Ah, it actually did! (We just had no event on this repo trigger the CI since the last get_processor / run_processor changes in core.)
Turns out I had to do a lot more to fix the CI:
apt-get update
) must be optionalOk, I ran into https://github.com/OCR-D/core/issues/998. Now, I could wait for the fix to be merged, or avoid the kwargs.setdefault('ocrd_tool', ...)
pattern. But I don't want either, so let's do a workaround for now.
Merging #191 (148d015) into master (c10f94d) will decrease coverage by
26.98%
. The diff coverage is0.00%
.
@@ Coverage Diff @@
## master #191 +/- ##
==========================================
- Coverage 26.97% 0.00% -26.98%
==========================================
Files 11 12 +1
Lines 1416 1377 -39
Branches 333 346 +13
==========================================
- Hits 382 0 -382
- Misses 981 1371 +390
+ Partials 53 6 -47
Impacted Files | Coverage Δ | |
---|---|---|
ocrd_tesserocr/binarize.py | 0.00% <0.00%> (-15.12%) |
:arrow_down: |
ocrd_tesserocr/config.py | 0.00% <ø> (-100.00%) |
:arrow_down: |
ocrd_tesserocr/crop.py | 0.00% <0.00%> (-14.29%) |
:arrow_down: |
ocrd_tesserocr/deskew.py | 0.00% <0.00%> (-13.52%) |
:arrow_down: |
ocrd_tesserocr/fontshape.py | 0.00% <0.00%> (-17.83%) |
:arrow_down: |
ocrd_tesserocr/recognize.py | 0.00% <0.00%> (-25.38%) |
:arrow_down: |
ocrd_tesserocr/segment.py | 0.00% <0.00%> (-36.00%) |
:arrow_down: |
ocrd_tesserocr/segment_line.py | 0.00% <0.00%> (-96.16%) |
:arrow_down: |
ocrd_tesserocr/segment_region.py | 0.00% <0.00%> (-96.43%) |
:arrow_down: |
ocrd_tesserocr/segment_table.py | 0.00% <0.00%> (-34.62%) |
:arrow_down: |
... and 1 more |
... and 1 file with indirect coverage changes
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
@kba if you make a release out of https://github.com/OCR-D/core/pull/999 today, I'll revert b357217 here and require that new core version.
@kba if you make a release out of OCR-D/core#999 today, I'll revert b357217 here and require that new core version.
v2.47.0 has been released now.
v2.47.0 has been released now.
Too early, because we need https://github.com/OCR-D/core/pull/1004.
v2.47.0 has been released now.
Thanks! I'll adapt...
Too early, because we need OCR-D/core#1004.
This is not about Docker, but Python API.
v2.47.0 has been released now.
Too early, because we need OCR-D/core#1004.
Hotfix incoming.
Ok, I ran into OCR-D/core#998. Now, I could wait for the fix to be merged, or avoid the
kwargs.setdefault('ocrd_tool', ...)
pattern. But I don't want either, so let's do a workaround for now.v2.47.0 has been released now.
Thanks! I'll adapt...
Odd. CI failure did come back:
File "/home/circleci/project/ocrd_tesserocr/cli.py", line 23, in ocrd_tesserocr_segment_region
return ocrd_cli_wrap_processor(TesserocrSegmentRegion, *args, **kwargs)
File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/decorators/__init__.py", line 116, in ocrd_cli_wrap_processor
run_processor(processorClass, mets_url=mets, workspace=workspace, **kwargs)
File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 95, in run_processor
instance_caching=instance_caching
File "/home/circleci/.pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 329, in get_processor
parameter=parameter
File "/home/circleci/project/ocrd_tesserocr/segment_region.py", line 17, in __init__
self.parameter['overwrite_segments'] = self.parameter['overwrite_regions']
KeyError: 'overwrite_regions'
That's despite ocrd==2.47.1
. Where does the empty ocrd_tool
come from???
That's despite
ocrd==2.47.1
. Where does the emptyocrd_tool
come from???
Aha! It turns out that my https://github.com/OCR-D/core/pull/999 was premature: we also pass an unnecessary ocrd_tool
in other places:
Now, the big story here is: none of these places is needed (or used) – neither currently, nor for OCR-D/core#974 nor OCR-D/core#884!
Other questions (hopefully it is ok to ask it here):
kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL])
instead of doing this: kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
(this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?__init__()
before calling the super constructor) is something like a rule (In Java you would probably use something like an abstract method in the processor-base-class?)?@joschrew
- What is the reason to do this
kwargs.setdefault('ocrd_tool', OCRD_TOOL['tools'][TOOL])
instead of doing this:kwargs['ocrd_tool'] = OCRD_TOOL['tools'][TOOL]
(this is the way it was done before). Why shouldn't the ocrd_tool always be set in the constructor?
The reason is that here in ocrd_tesserocr many processors are basically just re-parameterizations (simplifications) of the "goliath" ocrd-tesserocr-recognize. This helps the user cope with complexity (and avoids code duplication while sustaining the older single-step processors) – see README.
To implement this, you have to know that ocrd.Processor does parameter parsing/instantiation/validation on the constructor: https://github.com/OCR-D/core/blob/cff8ddaefec72451795a5e0b4bdb7efc18223d99/ocrd/ocrd/processor/base.py#L146-L150
So you need to
Now, in the old pattern, where every ocrd.Processor's subclass constructor just overwrites the kwarg for ocrd_tool
with its own local tool json (from pkg_resources), we cannot bring TesserocrRecognize to accept any foreign tool json. Hence the setdefault
pattern, which IMO is also better in general.
- I tried to get the idea behind the ocrd-tool-setting. Is it right that a processor-constructor is supposed to set the ocrd_tool always before calling the super-constructor?
Yes, it must. Otherwise, the parameter validator in the superclass won't see the actual cmdline values. Also, the non-processing contexts (help, dump json, show resources) in the superclass constructor need to see the actual tool json.
I have only viewed a few example processors and I am wondering if this (setting ocrd-tool in a processors
__init__()
before calling the super constructor) is something like a rule
We could of course change the API in core to separate the parameter instantiation, but that would entail changing all processors (lots of diverse codebases with various maintainers). We have to do that for the new process_page
API soon anyway, so it's a good time to think about alternatives...
(In Java you would probably use something like an abstract method in the processor-base-class?)?
Yes. We could define a method (say) init
which (in the subclasses) loads the tool json, and gets called (by the superclass) before most of the work done in the current superclass constructor.
For the processor delegation pattern then, since in the new API, we will also have a setup
for the processing context of a processor instance, we could do parameter re-instantiation for the new superclass there. So the subclasses would simply have to call super().setup()
and then do their own ParameterValidator.validate.
With this PR (and ocrd 2.48.0) I seem to have problems relating to $TESSDATA_PREFIX:
$ ocrd-tesserocr-segment-region -I OCR-D-IMG-BINPAGE-sauvola -O OCR-D-TEST-TEST-TEST
18:37:10.199 ERROR ocrd.processor.helpers.run_processor - Failure in processor 'ocrd-tesserocr-segment-region'
Traceback (most recent call last):
File "/usr/local/share/pyenv/versions/3.7.16/lib/python3.7/site-packages/ocrd/processor/helpers.py", line 128, in run_processor
processor.process()
[... shortened traceback, so remove noise (turned out to be notabug)]
RuntimeError: Failed to init API, possibly an invalid tessdata path:
$ echo $TESSDATA_PREFIX
/home/mike.gerber/.local/share/ocrd-resources/ocrd-tesserocr-recognize
$ ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata deu.traineddata frk.traineddata
ls $TESSDATA_PREFIX
Fraktur_GT4HistOCR.traineddata deu.traineddata frk.traineddata
You need osd.traineddata
I think. Can you try ocrd resmgr download ocrd-tesserocr-recognize osd.traineddata
and check if that fixes it?
Ah. Never had it working on this system, and I wrongly assumed it was a new bug.
I copied everything from /usr/share/.../tessdata for now, that fixed ocrd-tesserocr-segment-region
! This PR's ocrd-tesserocr-recognize
also works for me. OCR-D 2.48.0.
@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test :roll_eyes: – do you have any idea what's going on?
@kba this time, codecov broke – IIUC it says it now has 0% coverage, because I have added more processors to the test 🙄 – do you have any idea what's going on?
Really confusing, I think the problem is that the last time code coverage was calculated on master, only four of the five python versions uploaded the results to codecov successfully. Now it cannot match the coverage reports to the four it last had in master. I cannot verify this because CircleCI does not retain logs of the last run on master.
In other words, I think this will clear up once we merge to master. The code coverage calculation looks right:
======================= 4 passed, 38 warnings in 22.72s ========================
make[1]: Leaving directory '/home/circleci/project'
coverage report
Name Stmts Miss Branch BrPart Cover
--------------------------------------------------------------------
ocrd_tesserocr/__init__.py 10 0 0 0 100%
ocrd_tesserocr/binarize.py 85 72 34 0 11%
ocrd_tesserocr/config.py 3 0 0 0 100%
ocrd_tesserocr/crop.py 113 99 28 0 10%
ocrd_tesserocr/deskew.py 109 95 48 0 9%
ocrd_tesserocr/fontshape.py 96 79 34 0 13%
ocrd_tesserocr/recognize.py 861 515 500 79 35%
ocrd_tesserocr/segment.py 18 8 2 0 50%
ocrd_tesserocr/segment_line.py 19 0 2 1 95%
ocrd_tesserocr/segment_region.py 21 0 2 1 96%
ocrd_tesserocr/segment_table.py 19 9 2 0 48%
ocrd_tesserocr/segment_word.py 19 0 2 1 95%
--------------------------------------------------------------------
TOTAL 1373 877 654 82 31%
(for python3.9)
Understood. Thanks!
I'll merge and release then.
fixes #190
Perhaps we should also add a test case for the workspace mechanics. I have a feeling this should have failed long ago even without the Processing Server's instance caching.